|
|
|
--- |
|
language: |
|
- dan |
|
license: apache-2.0 |
|
datasets: |
|
- LumiOpen/hpltv2-llama33-edu-annotation |
|
--- |
|
|
|
# Llama-HPLT-edu-Danish classifier |
|
|
|
## Model summary |
|
This is a classifier for judging the educational content of Danish (dan-Latn) web pages. It was developed to filter educational content from [HPLT v2](https://hplt-project.org/datasets/v2.0) and was trained on 450k [annotations](https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation) generated by [LLama3.3-70B-instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct). |
|
The web pages were sampled randomly from Danish subset of the corpus. |
|
### How to load in transformers |
|
To load the Llama-HPLT-Edu classifier, use the following code: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn") |
|
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn") |
|
text = "I'm non-educational web page containing nothing useful" |
|
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True) |
|
outputs = model(**inputs) |
|
logits = outputs.logits.squeeze(-1).float().detach().numpy() |
|
score = logits.item() |
|
result = { |
|
"text": text, |
|
"score": score, |
|
"int_score": int(round(max(0, min(score, 5)))), |
|
} |
|
print(result) |
|
#results from a model trained with Welsh annotations |
|
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1} |
|
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2} |
|
``` |
|
|
|
## Training |
|
- Model: FacebookAI/xlm-roberta-large with a classification head |
|
- Dataset: 500,000 samples from Llama3.3 annotations split into 450,000 train, 25,000 validation, and 25,000 test splits. |
|
- Epochs: 20 |
|
- Learning Rate: 3e-4 |
|
- Evaluation Metric: F1 score |
|
|
|
### Test Metrics |
|
``` |
|
|
|
precision recall f1-score support |
|
|
|
0 0.85 0.74 0.80 12079 |
|
1 0.58 0.72 0.64 8509 |
|
2 0.46 0.50 0.48 2811 |
|
3 0.37 0.29 0.32 1004 |
|
4 0.70 0.17 0.27 577 |
|
5 0.10 0.05 0.07 20 |
|
|
|
accuracy 0.68 25000 |
|
macro avg 0.51 0.41 0.43 25000 |
|
weighted avg 0.69 0.68 0.68 25000 |
|
|
|
``` |
|
|
|
## Citing |
|
Preprint coming soon. If you need to cite this work, please use the citation below: |
|
``` |
|
@misc {llama_hplt_edu_classifiers_2025, |
|
author = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo } |
|
title = { Llama-HPLT-edu classifiers }, |
|
year = 2025, |
|
url = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb}, |
|
publisher = { Hugging Face } |
|
} |
|
``` |
|
|