Model Card for llama-estllm-protype-0825

llama-estllm-protype-0825 is the first artifact produced by the EstLLM project. The intention of this release is to evaluate the first prototype in a conversational ChatbotArena-style setting on baromeeter.ai, and thus establish a baseline for future improvements.

The model underwent continuous pre-training starting from Llama-3.1-8B on approximately 35B tokens, then supervised fine-tuning and direct preference optimization were applied.

Model Details

Model Description

Developed by: TartuNLP and TalTechNLP research groups
Funded by: Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
Model type: Causal Language Model, Instruction-following
Language(s) (NLP): Estonian, English
License: Llama 3.1 Community License Agreement
Finetuned from model meta-llama/Llama-3.1-8B

Evaluation

Instruction-following

Every benchmark in this category is treated as a generative problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits).

Model (# parameters ↓)	IFEval-et*	Winogrande-et**	Trivia-et***	Grammar-et****
moonshotai/Kimi-K2-Instruct	0.7891	0.8138	0.4225	0.916
deepseek-ai/DeepSeek-V3-0324	0.7171	0.8042	0.27	0.364
meta-llama/Llama-3.1-405B-Instruct	0.7159	0.7878	0.4713	0.818
meta-llama/Llama-3.3-70B-Instruct	0.7705	0.7397	0.3875	0.797
Qwen/Qwen2.5-72B-Instruct	0.7407	0.7227	0.315	0.694
google/gemma-3-27b-it	0.7655	0.7510	0.325	0.817
utter-project/EuroLLM-9B-Instruct	0.5397	0.5846	0.3738	0.764
meta-llama/Llama-3.1-8B-Instruct	0.3797	0.5399	0.2888	0.657
tartuNLP/llama-estlm-prototype-0825	0.5174	0.5812	0.425	0.692
BSC-LT/salamandra-7b-instruct	0.5195	0.2878	0.2875	0.594
tartuNLP/Llammas	0.3524	0.5037	0.2838	0.529
Qwen/Qwen2.5-7B-Instruct	0.4988	0.5473	0.2938	0.598

* inst_level_strict_acc

** 3-shot, accuracy

*** 0-shot, accuracy

**** 0-shot, accuracy, formatted as multiple-choice

Translation

English to Estonian

Model	wmt24pp (bleu ↑)
tartuNLP/llama-estlm-prototype-0825	0.264
utter-project/EuroLLM-9B-Instruct	0.2602
tartuNLP/Llammas	0.1472
meta-llama/Llama-3.1-8B-Instruct	0.1406
BSC-LT/salamandra-7b-instruct	0.1201
Qwen/Qwen2.5-7B-Instruct	0.0476

Limitations

This is an early prototype version. Accordignly, it has limitations in addition to the base Llama limitations:

Relatively short context of 4096 tokens. It's not expected to perform well on context sizes beyond that.
Multi-turn conversations are not supported in this version.
Trained with the original Llama 3.1 system prompt that has a hard-coded date cut-off.

Citation

TBA

tartuNLP
/

llama-estllm-protype-0825