HPLT-edu classifiers
Collection
76 items
•
Updated
This is a classifier for judging the educational content of Croatian (hrv-Latn) web pages. It was developed to filter educational content from HPLT v2 and was trained on 450k annotations generated by LLama3.3-70B-instruct. The web pages were sampled randomly from Croatian subset of the corpus.
To load the Llama-HPLT-Edu classifier, use the following code:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-hrv-Latn")
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-hrv-Latn")
text = "I'm non-educational web page containing nothing useful"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
"text": text,
"score": score,
"int_score": int(round(max(0, min(score, 5)))),
}
print(result)
#results from a model trained with Welsh annotations
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}
precision recall f1-score support
0 0.84 0.64 0.73 10199
1 0.56 0.71 0.62 8850
2 0.45 0.57 0.50 3638
3 0.38 0.32 0.35 1505
4 0.72 0.19 0.30 788
5 0.29 0.10 0.15 20
accuracy 0.62 25000
macro avg 0.54 0.42 0.44 25000
weighted avg 0.65 0.62 0.62 25000
Preprint coming soon. If you need to cite this work, please use the citation below:
@misc {llama_hplt_edu_classifiers_2025,
author = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
title = { Llama-HPLT-edu classifiers },
year = 2025,
url = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
publisher = { Hugging Face }
}