metadata
language:
- dan
license: apache-2.0
datasets:
- LumiOpen/hpltv2-llama33-edu-annotation
Llama-HPLT-edu-Danish classifier
Model summary
This is a classifier for judging the educational content of Danish (dan-Latn) web pages. It was developed to filter educational content from HPLT v2 and was trained on 450k annotations generated by LLama3.3-70B-instruct. The web pages were sampled randomly from Danish subset of the corpus.
How to load in transformers
To load the Llama-HPLT-Edu classifier, use the following code:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
text = "I'm non-educational web page containing nothing useful"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
"text": text,
"score": score,
"int_score": int(round(max(0, min(score, 5)))),
}
print(result)
#results from a model trained with Welsh annotations
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}
Training
- Model: FacebookAI/xlm-roberta-large with a classification head
- Dataset: 500,000 samples from Llama3.3 annotations split into 450,000 train, 25,000 validation, and 25,000 test splits.
- Epochs: 20
- Learning Rate: 3e-4
- Evaluation Metric: F1 score
Test Metrics
precision recall f1-score support
0 0.85 0.74 0.80 12079
1 0.58 0.72 0.64 8509
2 0.46 0.50 0.48 2811
3 0.37 0.29 0.32 1004
4 0.70 0.17 0.27 577
5 0.10 0.05 0.07 20
accuracy 0.68 25000
macro avg 0.51 0.41 0.43 25000
weighted avg 0.69 0.68 0.68 25000
Citing
Preprint coming soon. If you need to cite this work, please use the citation below:
@misc {llama_hplt_edu_classifiers_2025,
author = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
title = { Llama-HPLT-edu classifiers },
year = 2025,
url = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
publisher = { Hugging Face }
}