Llama-HPLT-edu-Kazakh classifier

Model summary

This is a classifier for judging the educational content of Kazakh (kaz-Cyrl) web pages. It was developed to filter educational content from HPLT v2 and was trained on 450k annotations generated by LLama3.3-70B-instruct. The web pages were sampled randomly from Kazakh subset of the corpus.

How to load in transformers

To load the Llama-HPLT-Edu classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-kaz-Cyrl")
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-kaz-Cyrl")
text = "I'm non-educational web page containing nothing useful"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
    "text": text,
    "score": score,
    "int_score": int(round(max(0, min(score, 5)))),
}
print(result)
#results from a model trained with Welsh annotations
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}

Training

Model: FacebookAI/xlm-roberta-large with a classification head
Dataset: 500,000 samples from Llama3.3 annotations split into 450,000 train, 25,000 validation, and 25,000 test splits.
Epochs: 20
Learning Rate: 3e-4
Evaluation Metric: F1 score

Test Metrics


              precision    recall  f1-score   support

           0       0.79      0.48      0.60       7390
           1       0.58      0.73      0.65       9460
           2       0.46      0.56      0.51       4245
           3       0.37      0.45      0.41       1972
           4       0.64      0.37      0.47       1546
           5       0.67      0.59      0.63       387

    accuracy                           0.58      25000
   macro avg       0.59      0.53      0.54      25000
weighted avg       0.61      0.58      0.58      25000

Citing

Preprint coming soon. If you need to cite this work, please use the citation below:

@misc {llama_hplt_edu_classifiers_2025,
    author       = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
    title        = { Llama-HPLT-edu classifiers },
    year         = 2025,
    url          = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
    publisher    = { Hugging Face }
}

LumiOpen
/

llama-hpltv2-edu-classifier-xlm-roberta-large-kaz-Cyrl

Llama-HPLT-edu-Kazakh classifier

Model summary

How to load in transformers

Training

Test Metrics

Citing

Dataset used to train LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-kaz-Cyrl

Collection including LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-kaz-Cyrl

HPLT-edu classifiers