File size: 3,006 Bytes
6778baa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

---
language:
- dan
license: apache-2.0
datasets:
- LumiOpen/hpltv2-llama33-edu-annotation
---

# Llama-HPLT-edu-Danish classifier

## Model summary
This is a classifier for judging the educational content of Danish (dan-Latn) web pages. It was developed to filter educational content from [HPLT v2](https://hplt-project.org/datasets/v2.0) and was trained on 450k [annotations](https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation) generated by [LLama3.3-70B-instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
The web pages were sampled randomly from Danish subset of the corpus.
### How to load in transformers
To load the Llama-HPLT-Edu classifier, use the following code:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
text = "I'm non-educational web page containing nothing useful"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
    "text": text,
    "score": score,
    "int_score": int(round(max(0, min(score, 5)))),
}
print(result)
#results from a model trained with Welsh annotations
#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}
```

## Training
- Model: FacebookAI/xlm-roberta-large with a classification head
- Dataset: 500,000 samples from Llama3.3 annotations split into 450,000 train, 25,000 validation, and 25,000 test splits.
- Epochs: 20
- Learning Rate: 3e-4
- Evaluation Metric: F1 score

### Test Metrics
```

              precision    recall  f1-score   support

           0       0.85      0.74      0.80       12079
           1       0.58      0.72      0.64       8509
           2       0.46      0.50      0.48       2811
           3       0.37      0.29      0.32       1004
           4       0.70      0.17      0.27       577
           5       0.10      0.05      0.07       20

    accuracy                           0.68      25000
   macro avg       0.51      0.41      0.43      25000
weighted avg       0.69      0.68      0.68      25000

```

## Citing
Preprint coming soon. If you need to cite this work, please use the citation below: 
```
@misc {llama_hplt_edu_classifiers_2025,
    author       = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
    title        = { Llama-HPLT-edu classifiers },
    year         = 2025,
    url          = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
    publisher    = { Hugging Face }
}
```