π domestic-yak, a Macedonian LM (base version)
Model Summary
This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
π Results
The table below compares the performance of our model, domestic-yak-8B, with its foundational model, Llama 3.1-8B Instruct evaluated using the macedonian-llm-eval benchmark.
As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
Task (mk-version) | domestic-yak-8B | Llama 3.1-8B Instruct |
---|---|---|
ARC Easy | 0.5244 Β± 0.0102 | 0.4453 Β± 0.0102 |
ARC Challenge | 0.3183 Β± 0.0136 | 0.2824 Β± 0.0132 |
BoolQ | 0.7676 Β± 0.0074 | 0.7639 Β± 0.0074 |
HellaSwag | 0.4324 Β± 0.0049 | 0.3740 Β± 0.0048 |
Openbook QA | 0.2920 Β± 0.0204 | 0.2520 Β± 0.0194 |
PIQA | 0.6687 Β± 0.0110 | 0.5865 Β± 0.0115 |
NQ Open | 0.0416 Β± 0.0033 | 0.0335 Β± 0.0030 |
WinoGrande | 0.6259 Β± 0.0136 | 0.5683 Β± 0.0139 |
Quantitative evaluation is just one part of the story. We also conducted qualitative tests, interacting with the model to assess its ability to generate coherent and relevant responses in Macedonian. The results were satisfying: the model not only mastered the language's grammar but also demonstrated a natural and fluent style.
Hereβs a comparison showing the answers from Llama-3.1-8B-Instruct and domestic-yak-8B:
Llama-3.1-8B-Instruct:
"ΠΠ΄ΡΠ°Π²ΠΎ! Π‘Π΅ΠΊΠΎΡ ΠΏΠ°Ρ ΠΊΠΎΠ³Π° Π³ΠΎ ΠΊΠΎΡΠΈΡΡΠΈΡ ΠΎΠ²Π°Π° ΠΏΠ»Π°ΡΡΠΎΡΠΌΠ°, ΡΠΎ ΡΠΎΠ±Π΅ ΡΠΈ Π²ΠΎ ΠΊΠΎΠ½ΡΠ°ΠΊΡ ΡΠΎ ΠΌΠΎΡΠΎΡ ΡΠΈΡΡΠ΅ΠΌ Π½Π° Π²Π΅ΡΡΠ°ΡΠΊΠ° ΠΈΠ½ΡΠ΅Π»ΠΈΠ³Π΅Π½ΡΠΈΡΠ°. ΠΠΎΡΠΎΡ Π·Π°Π΄Π°ΡΠ° Π΅ Π΄Π° ΡΠΈ Π³ΠΎ ΠΎΠ±Π΅Π·Π±Π΅Π΄Π°ΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π΅Π½ ΠΈ ΠΊΠΎΡΠΈΡΠ΅Π½ ΠΎΠ΄Π³ΠΎΠ²ΠΎΡ Π½Π° ΡΠ΅ΠΊΠΎΠ΅ ΠΏΡΠ°ΡΠ°ΡΠ΅ ΡΡΠΎ ΠΌΠΎΠΆΠ΅Ρ Π΄Π° ΠΌΠΈ Π³ΠΎ ΠΏΠΎΡΡΠ°Π²ΠΈΡ. Π‘Π΅ Π½Π°ΡΠ΅ΠΊΡΠ²Π°ΠΌ 'ΠΡΠΈΡΡΠ΅Π½Ρ', ΠΈ ΡΡΠΌ ΡΡΠΊΠ° Π·Π° Π΄Π° ΡΠΈ Π³ΠΎ ΠΏΡΠ°Π²Π°ΠΌ ΠΆΠΈΠ²ΠΎΡ ΠΏΠΎΠ»Π΅ΡΠ΅Π½ ΠΈ ΠΈΠ½ΡΠ΅ΡΠ΅ΡΠ½ΠΈΠΎΡ. Π¨ΡΠΎ ΡΠ²ΠΎΠ΅ΡΠΎ ΠΏΡΠ°ΡΠ°ΡΠ΅ Π΅?"domestic-yak-8B:
"ΠΠ΄ΡΠ°Π²ΠΎ, ΡΠ°Ρ ΡΡΠΌ Π°ΡΠΈΡΡΠ΅Π½Ρ Π΄ΠΈΠ·Π°ΡΠ½ΠΈΡΠ°Π½ Π΄Π° ΠΎΠ±Π΅Π·Π±Π΅Π΄ΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ ΠΈ ΠΎΠ΄Π³ΠΎΠ²ΠΎΡΠΈ Π½Π° ΠΏΡΠ°ΡΠ°ΡΠ° ΠΎΠ΄ ΠΊΠΎΡΠΈΡΠ½ΠΈΡΠΈΡΠ΅. ΠΠΎΠΆΠ°ΠΌ Π΄Π° ΠΎΠ΄Π³ΠΎΠ²Π°ΡΠ°ΠΌ Π½Π° ΡΠΈΡΠΎΠΊ ΡΠΏΠ΅ΠΊΡΠ°Ρ Π½Π° ΠΏΡΠ°ΡΠ°ΡΠ°, ΠΎΠ΄ ΠΎΠΏΡΡΠΈ ΡΠ°ΠΊΡΠΈ ΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ Π΄ΠΎ ΡΠΏΠ΅ΡΠΈΡΠ°Π»ΠΈΠ·ΠΈΡΠ°Π½ΠΈ ΠΏΡΠ°ΡΠ°ΡΠ° Π²ΠΎ ΡΠ°Π·Π»ΠΈΡΠ½ΠΈ ΠΎΠ±Π»Π°ΡΡΠΈ. ΠΠΎΡΠ°ΡΠ° ΡΠ΅Π» Π΅ Π΄Π° ΠΎΠ±Π΅Π·Π±Π΅Π΄Π°ΠΌ ΡΠΎΡΠ½ΠΈ ΠΈ ΠΊΠΎΡΠΈΡΠ½ΠΈ ΠΎΠ΄Π³ΠΎΠ²ΠΎΡΠΈ, ΠΊΠ°ΠΊΠΎ ΠΈ Π΄Π° Π³ΠΎ ΠΎΠ»Π΅ΡΠ½Π°ΠΌ ΠΏΡΠΎΡΠ΅ΡΠΎΡ Π½Π° ΠΏΡΠ΅Π±Π°ΡΡΠ²Π°ΡΠ΅ ΠΈ Π΄Π° Π·Π°ΡΡΠ΅Π΄Π°ΠΌ Π²ΡΠ΅ΠΌΠ΅ Π·Π° ΠΊΠΎΡΠΈΡΠ½ΠΈΡΠΈΡΠ΅. Π‘Π»ΠΎΠ±ΠΎΠ΄Π½ΠΎ ΠΏΠΎΡΡΠ°Π²ΠΈ ΠΌΠΈ Π³ΠΎ ΡΠ²ΠΎΠ΅ΡΠΎ ΠΏΡΠ°ΡΠ°ΡΠ΅ ΠΈ ΡΠ΅ ΡΠ΅ ΠΎΠ±ΠΈΠ΄Π°ΠΌ Π΄Π° ΡΠΈ ΠΏΠΎΠΌΠΎΠ³Π½Π°ΠΌ."
π Key Details
- Language: Macedonian (
mk
) - Base Model: Meta Llama 3.1 8B Instruct
- Dataset: LVSTCK/macedonian-corpus-raw-dedup (deduplicated version)
- Training Tokens: ~1.6 billion
- Pretraining Epochs: 1 epoch
- Pretraining Objective: Causal Language Modeling (continued pretraining using all the weights)
β οΈ Limitations
- Biases: The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
- Domain Specificity: While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
- Chat Capabilities: This version is the base model so its chat capabilities might be limited. If you would like to chat use the instruct version.
π¬ Contact
For inquiries, feedback, or contributions, please feel free to reach out to the core team:
Citation
@article{krsteski2025towards,
title={Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language},
author={Krsteski, Stefan and Tashkovska, Matea and Sazdov, Borjan and Gjoreski, Hristijan and Gerazov, Branislav},
journal={arXiv preprint arXiv:2506.09560},
year={2025}
}
Paper
The model was presented in the paper Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language.
- Downloads last month
- 9
Model tree for LVSTCK/domestic-yak-8B
Base model
meta-llama/Llama-3.1-8B