LumiOpen
/

llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn

Model card Files Files and versions

llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn / README.md

akseli-reunamo's picture

Upload folder using huggingface_hub

6778baa verified 2 days ago

|

history blame contribute delete

3.01 kB


	---
	language:
	- dan
	license: apache-2.0
	datasets:
	- LumiOpen/hpltv2-llama33-edu-annotation
	---

	# Llama-HPLT-edu-Danish classifier

	## Model summary
	This is a classifier for judging the educational content of Danish (dan-Latn) web pages. It was developed to filter educational content from [HPLT v2](https://hplt-project.org/datasets/v2.0) and was trained on 450k [annotations](https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation) generated by [LLama3.3-70B-instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
	The web pages were sampled randomly from Danish subset of the corpus.
	### How to load in transformers
	To load the Llama-HPLT-Edu classifier, use the following code:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
	model = AutoModelForSequenceClassification.from_pretrained("LumiOpen/llama-hpltv2-edu-classifier-xlm-roberta-large-dan-Latn")
	text = "I'm non-educational web page containing nothing useful"
	inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
	outputs = model(**inputs)
	logits = outputs.logits.squeeze(-1).float().detach().numpy()
	score = logits.item()
	result = {
	"text": text,
	"score": score,
	"int_score": int(round(max(0, min(score, 5)))),
	}
	print(result)
	#results from a model trained with Welsh annotations
	#{'text': "I'm non-educational web page containing nothing useful", 'score': 0.8145455718040466, 'int_score': 1}
	#{'text': 'what are most common animals found in farm? there are cows, sheeps', 'score': 1.6858888864517212, 'int_score': 2}
	```

	## Training
	- Model: FacebookAI/xlm-roberta-large with a classification head
	- Dataset: 500,000 samples from Llama3.3 annotations split into 450,000 train, 25,000 validation, and 25,000 test splits.
	- Epochs: 20
	- Learning Rate: 3e-4
	- Evaluation Metric: F1 score

	### Test Metrics
	```

	precision recall f1-score support

	0 0.85 0.74 0.80 12079
	1 0.58 0.72 0.64 8509
	2 0.46 0.50 0.48 2811
	3 0.37 0.29 0.32 1004
	4 0.70 0.17 0.27 577
	5 0.10 0.05 0.07 20

	accuracy 0.68 25000
	macro avg 0.51 0.41 0.43 25000
	weighted avg 0.69 0.68 0.68 25000

	```

	## Citing
	Preprint coming soon. If you need to cite this work, please use the citation below:
	```
	@misc {llama_hplt_edu_classifiers_2025,
	author = { Tarkka, Otto, Reunamo, Akseli, Vitiugin, Fedor and Pyysalo, Sampo }
	title = { Llama-HPLT-edu classifiers },
	year = 2025,
	url = {https://huggingface.co/collections/LumiOpen/hplt-edu-classifiers-68a85a78f9710426320e7cbb},
	publisher = { Hugging Face }
	}
	```