Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
Abstract
Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce Babel, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: Babel-9B, designed for efficient inference and fine-tuning, and Babel-83B, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.
Community
š Key Highlights:
1ļøā£ Convering 90% populationāsupporting top 25 languages, prioritizing widely spoken but previously underexplored languages in open multilingual models.
2ļøā£ Innovative architectureāUnlike traditional continued pretraining approaches, Babel expands its parameter count through model extension, raising its performance ceiling.
3ļøā£ Two powerful variants
š”Babel-9BāDesigned for efficient inference and fine-tuning.
š”Babel-83BāA new benchmark for open multilingual LLMs.
Great
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation (2025)
- CoCo-CoLa: Evaluating Language Adherence in Multilingual LLMs (2025)
- Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages (2025)
- LLMic: Romanian Foundation Language Model (2025)
- The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models (2025)
- Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding (2025)
- One ruler to measure them all: Benchmarking multilingual long-context language models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper