Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
Abstract
Hala, a family of Arabic-centric instruction and translation models, achieves state-of-the-art results using a translate-and-tune pipeline, slerp merging, and fine-tuning on high-quality bilingual supervision.
We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong ARleftrightarrowEN teacher to FP8 (yielding sim2times higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" (leq2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.
Community
A series of state-of-the-art nano and small scale Arabic language models.
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/hala-technical-report-building-arabic-centric-instruction-amp-translation-models-at-scale
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion (2025)
- DocHPLT: A Massively Multilingual Document-Level Translation Dataset (2025)
- Improving LLMs for Machine Translation Using Synthetic Preference Data (2025)
- 3LM: Bridging Arabic, STEM, and Code through Benchmarking (2025)
- Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks (2025)
- Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian (2025)
- TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper