HPLT-edu classifiers
Collection
76 items
•
Updated
This is a classifier for judging the educational content of Kazakh (kaz-Cyrl) web pages. It was developed to filter educational content from HPLT v2 and was trained on 450k annotations generated by LLama3.3-70B-instruct. The web pages were sampled randomly from Kazakh subset of the corpus.
To load the Llama-HPLT-Edu classifier, use the following code:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-kaz-Cyrl")
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-kaz-Cyrl")
text = "I'm non-educational web page containing nothing useful"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
"text": text,
"score": score,
"int_score": int(round(max(0, min(score, 5)))),
}
print(result)
#results from a model trained with Welsh annotations
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}
precision recall f1-score support
0 0.79 0.48 0.60 7390
1 0.58 0.73 0.65 9460
2 0.46 0.56 0.51 4245
3 0.37 0.45 0.41 1972
4 0.64 0.37 0.47 1546
5 0.67 0.59 0.63 387
accuracy 0.58 25000
macro avg 0.59 0.53 0.54 25000
weighted avg 0.61 0.58 0.58 25000
Preprint coming soon. If you need to cite this work, please use the citation below:
@misc {llama_hplt_edu_classifiers_2025,
author = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
title = { Llama-HPLT-edu classifiers },
year = 2025,
url = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
publisher = { Hugging Face }
}