Llama-HPLT-edu-Danish classifier

Model summary

This is a classifier for judging the educational content of Danish (dan-Latn) web pages. It was developed to filter educational content from HPLT v2 and was trained on 450k annotations generated by LLama3.3-70B-instruct. The web pages were sampled randomly from Danish subset of the corpus.

How to load in transformers

To load the Llama-HPLT-Edu classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
text = "I'm non-educational web page containing nothing useful"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
    "text": text,
    "score": score,
    "int_score": int(round(max(0, min(score, 5)))),
}
print(result)
#results from a model trained with Welsh annotations
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}

Training

  • Model: FacebookAI/xlm-roberta-large with a classification head
  • Dataset: 500,000 samples from Llama3.3 annotations split into 450,000 train, 25,000 validation, and 25,000 test splits.
  • Epochs: 20
  • Learning Rate: 3e-4
  • Evaluation Metric: F1 score

Test Metrics


              precision    recall  f1-score   support

           0       0.85      0.74      0.80       12079
           1       0.58      0.72      0.64       8509
           2       0.46      0.50      0.48       2811
           3       0.37      0.29      0.32       1004
           4       0.70      0.17      0.27       577
           5       0.10      0.05      0.07       20

    accuracy                           0.68      25000
   macro avg       0.51      0.41      0.43      25000
weighted avg       0.69      0.68      0.68      25000

Citing

Preprint coming soon. If you need to cite this work, please use the citation below:

@misc {llama_hplt_edu_classifiers_2025,
    author       = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
    title        = { Llama-HPLT-edu classifiers },
    year         = 2025,
    url          = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
    publisher    = { Hugging Face }
}
Downloads last month
-
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn

Collection including LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn