akseli-reunamo's picture
Upload folder using huggingface_hub
6778baa verified
metadata
language:
  - dan
license: apache-2.0
datasets:
  - LumiOpen/hpltv2-llama33-edu-annotation

Llama-HPLT-edu-Danish classifier

Model summary

This is a classifier for judging the educational content of Danish (dan-Latn) web pages. It was developed to filter educational content from HPLT v2 and was trained on 450k annotations generated by LLama3.3-70B-instruct. The web pages were sampled randomly from Danish subset of the corpus.

How to load in transformers

To load the Llama-HPLT-Edu classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
text = "I'm non-educational web page containing nothing useful"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
    "text": text,
    "score": score,
    "int_score": int(round(max(0, min(score, 5)))),
}
print(result)
#results from a model trained with Welsh annotations
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}

Training

  • Model: FacebookAI/xlm-roberta-large with a classification head
  • Dataset: 500,000 samples from Llama3.3 annotations split into 450,000 train, 25,000 validation, and 25,000 test splits.
  • Epochs: 20
  • Learning Rate: 3e-4
  • Evaluation Metric: F1 score

Test Metrics


              precision    recall  f1-score   support

           0       0.85      0.74      0.80       12079
           1       0.58      0.72      0.64       8509
           2       0.46      0.50      0.48       2811
           3       0.37      0.29      0.32       1004
           4       0.70      0.17      0.27       577
           5       0.10      0.05      0.07       20

    accuracy                           0.68      25000
   macro avg       0.51      0.41      0.43      25000
weighted avg       0.69      0.68      0.68      25000

Citing

Preprint coming soon. If you need to cite this work, please use the citation below:

@misc {llama_hplt_edu_classifiers_2025,
    author       = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
    title        = { Llama-HPLT-edu classifiers },
    year         = 2025,
    url          = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
    publisher    = { Hugging Face }
}