πŸ‚ domestic-yak, a Macedonian LM (base version)

Model Summary

This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.

πŸ“Š Results

The table below compares the performance of our model, domestic-yak-8B, with its foundational model, Llama 3.1-8B Instruct evaluated using the macedonian-llm-eval benchmark.

As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.

Task (mk-version) domestic-yak-8B Llama 3.1-8B Instruct
ARC Easy 0.5244 Β± 0.0102 0.4453 Β± 0.0102
ARC Challenge 0.3183 Β± 0.0136 0.2824 Β± 0.0132
BoolQ 0.7676 Β± 0.0074 0.7639 Β± 0.0074
HellaSwag 0.4324 Β± 0.0049 0.3740 Β± 0.0048
Openbook QA 0.2920 Β± 0.0204 0.2520 Β± 0.0194
PIQA 0.6687 Β± 0.0110 0.5865 Β± 0.0115
NQ Open 0.0416 Β± 0.0033 0.0335 Β± 0.0030
WinoGrande 0.6259 Β± 0.0136 0.5683 Β± 0.0139

Quantitative evaluation is just one part of the story. We also conducted qualitative tests, interacting with the model to assess its ability to generate coherent and relevant responses in Macedonian. The results were satisfying: the model not only mastered the language's grammar but also demonstrated a natural and fluent style.

Here’s a comparison showing the answers from Llama-3.1-8B-Instruct and domestic-yak-8B:

  • Llama-3.1-8B-Instruct:
    "Π—Π΄Ρ€Π°Π²ΠΎ! БСкој ΠΏΠ°Ρ‚ ΠΊΠΎΠ³Π° Π³ΠΎ ΠΊΠΎΡ€ΠΈΡΡ‚ΠΈΡˆ ΠΎΠ²Π°Π° ΠΏΠ»Π°Ρ‚Ρ„ΠΎΡ€ΠΌΠ°, со Ρ‚ΠΎΠ±Π΅ си Π²ΠΎ ΠΊΠΎΠ½Ρ‚Π°ΠΊΡ‚ со ΠΌΠΎΡ˜ΠΎΡ‚ систСм Π½Π° Π²Π΅ΡˆΡ‚Π°Ρ‡ΠΊΠ° ΠΈΠ½Ρ‚Π΅Π»ΠΈΠ³Π΅Π½Ρ†ΠΈΡ˜Π°. ΠœΠΎΡ˜ΠΎΡ‚ Π·Π°Π΄Π°Ρ‡Π° Π΅ Π΄Π° Ρ‚ΠΈ Π³ΠΎ ΠΎΠ±Π΅Π·Π±Π΅Π΄Π°ΠΌ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π΅Π½ ΠΈ корисСн ΠΎΠ΄Π³ΠΎΠ²ΠΎΡ€ Π½Π° сСкоС ΠΏΡ€Π°ΡˆΠ°ΡšΠ΅ ΡˆΡ‚ΠΎ моТСш Π΄Π° ΠΌΠΈ Π³ΠΎ ΠΏΠΎΡΡ‚Π°Π²ΠΈΡˆ. Π‘Π΅ Π½Π°Ρ€Π΅ΠΊΡƒΠ²Π°ΠΌ 'АсистСнт', ΠΈ сум Ρ‚ΡƒΠΊΠ° Π·Π° Π΄Π° Ρ‚ΠΈ Π³ΠΎ ΠΏΡ€Π°Π²Π°ΠΌ ΠΆΠΈΠ²ΠΎΡ‚ полСсСн ΠΈ интСрСсниот. Π¨Ρ‚ΠΎ Ρ‚Π²ΠΎΠ΅Ρ‚ΠΎ ΠΏΡ€Π°ΡˆΠ°ΡšΠ΅ Π΅?"

  • domestic-yak-8B:
    "Π—Π΄Ρ€Π°Π²ΠΎ, јас сум асистСнт Π΄ΠΈΠ·Π°Ρ˜Π½ΠΈΡ€Π°Π½ Π΄Π° ΠΎΠ±Π΅Π·Π±Π΅Π΄ΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΈ ΠΈ ΠΎΠ΄Π³ΠΎΠ²ΠΎΡ€ΠΈ Π½Π° ΠΏΡ€Π°ΡˆΠ°ΡšΠ° ΠΎΠ΄ корисницитС. МоТам Π΄Π° ΠΎΠ΄Π³ΠΎΠ²Π°Ρ€Π°ΠΌ Π½Π° ΡˆΠΈΡ€ΠΎΠΊ спСктар Π½Π° ΠΏΡ€Π°ΡˆΠ°ΡšΠ°, ΠΎΠ΄ ΠΎΠΏΡˆΡ‚ΠΈ Ρ„Π°ΠΊΡ‚ΠΈ ΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΈ Π΄ΠΎ ΡΠΏΠ΅Ρ†ΠΈΡ˜Π°Π»ΠΈΠ·ΠΈΡ€Π°Π½ΠΈ ΠΏΡ€Π°ΡˆΠ°ΡšΠ° Π²ΠΎ Ρ€Π°Π·Π»ΠΈΡ‡Π½ΠΈ области. ΠœΠΎΡ˜Π°Ρ‚Π° Ρ†Π΅Π» Π΅ Π΄Π° ΠΎΠ±Π΅Π·Π±Π΅Π΄Π°ΠΌ Ρ‚ΠΎΡ‡Π½ΠΈ ΠΈ корисни ΠΎΠ΄Π³ΠΎΠ²ΠΎΡ€ΠΈ, ΠΊΠ°ΠΊΠΎ ΠΈ Π΄Π° Π³ΠΎ олСснам процСсот Π½Π° ΠΏΡ€Π΅Π±Π°Ρ€ΡƒΠ²Π°ΡšΠ΅ ΠΈ Π΄Π° Π·Π°ΡˆΡ‚Π΅Π΄Π°ΠΌ Π²Ρ€Π΅ΠΌΠ΅ Π·Π° корисницитС. Π‘Π»ΠΎΠ±ΠΎΠ΄Π½ΠΎ постави ΠΌΠΈ Π³ΠΎ Ρ‚Π²ΠΎΠ΅Ρ‚ΠΎ ΠΏΡ€Π°ΡˆΠ°ΡšΠ΅ ΠΈ ќС сС ΠΎΠ±ΠΈΠ΄Π°ΠΌ Π΄Π° Ρ‚ΠΈ ΠΏΠΎΠΌΠΎΠ³Π½Π°ΠΌ."

πŸ”‘ Key Details

⚠️ Limitations

  • Biases: The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
  • Domain Specificity: While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
  • Chat Capabilities: This version is the base model so its chat capabilities might be limited. If you would like to chat use the instruct version.

πŸ“¬ Contact

For inquiries, feedback, or contributions, please feel free to reach out to the core team:

Citation

@article{krsteski2025towards,
  title={Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language},
  author={Krsteski, Stefan and Tashkovska, Matea and Sazdov, Borjan and Gjoreski, Hristijan and Gerazov, Branislav},
  journal={arXiv preprint arXiv:2506.09560},
  year={2025}
}

Paper

The model was presented in the paper Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language.

Downloads last month
9
Safetensors
Model size
8.03B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LVSTCK/domestic-yak-8B

Finetuned
(1575)
this model
Finetunes
1 model
Quantizations
1 model