Accepted in ACL Main 2025
TigerLLM - A Family of Bangla Large Language Models
Nishat Raihan, Marcos Zampieri
George Mason University, VA, USA
If you find our work helpful, please consider citing our paper:
@inproceedings{raihan-zampieri-2025-tigerllm,
title = "{T}iger{LLM} - A Family of {B}angla Large Language Models",
author = "Raihan, Nishat and
Zampieri, Marcos",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-short.69/",
doi = "10.18653/v1/2025.acl-short.69",
pages = "887--896",
ISBN = "979-8-89176-252-7"
}
Abstract
The development of Large Language Models (LLMs) remains heavily skewed towards English and a few other high-resource languages. This linguistic disparity is particularly evident for Bangla β the 5th most spoken language. A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM β a family of Bangla LLMs. Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5 across standard benchmarks, establishing TigerLLM as the new baseline for future Bangla language modeling.
1. Introduction
LLMs have fundamentally transformed NLP by achieving exceptional performance across a wide range of tasks. However, their advancements have predominantly benefited high-resource languages. Despite having about 237 million native Bangla speakers, Bangla remains underserved in modern NLP due to the lack of high-quality training data and reproducible methodologies.
1.1 Limitations of Bangla LLM Initiatives
Recent efforts (e.g., titu-Gemma, titu-LLaMA, Bangla-LLaMA, G2B) suffer from low reproducibility, suboptimal performance, and poor documentation. Many rely on translated synthetic datasets, leading to compromised instruction quality.
Base-LLM | Size | Pretraining (pt) |
Corpora | Finetuning (ft) |
Finetune Dataset | Paper/Report? | Reproducibility? |
---|---|---|---|---|---|---|---|
titu-Gemma (Gemma-2) | 2B | 4.4B | β | β | β | β | β |
titu-LLaMA (LLaMA-3.1) | 3B | 37B | β | β | β | β | β |
Bangla-LLaMA (LLaMA-3.2) | 3B | β | β | 172K (Orca-translated) |
β | β | β |
G2B (Gemma-2) | 9B | β | β | 145K (Alpaca-translated) |
β | β | β |
Bangla-LLaMA (LLaMA-2) | 13B | β | β | 145K (Alpaca-translated) |
β | β | β |
TigerLLM (LLaMA-3.2) | 1B | 10M | Bangla-TextBook | 100K (Bangla-Instruct) |
β | β | |
TigerLLM (Gemma-2) | 9B | 10M | Bangla-TextBook | 100K (Bangla-Instruct) |
β | β |
1.2 Contributions
- Bangla-TextBook Corpus: A 10M-token corpus of high-quality educational texts.
- Bangla-Instruct Dataset: 100K native Bangla instruction-response pairs generated via self-instruct and advanced teacher models.
- TigerLLM Models: A family of models (1B and 9B parameters) that achieve significant performance improvements over existing alternatives.
2. Bangla-TextBook Corpus
The Bangla-TextBook corpus is compiled exclusively from open-source educational materials provided by the National Curriculum and Textbook Board of Bangladesh. It aggregates texts from 163 textbooks for Grades 6β12, yielding 9,897,623 tokens and 697,903 sentences, capturing authentic academic language use.
3. Bangla-Instruct
To overcome previous limitations, the Bangla-Instruct dataset contains 100,000 instruction-response pairs generated using a self-instruct framework. Key steps include:
- Seed Task Generation: 500 tasks curated by 50 volunteers from diverse academic backgrounds.
- New instruction generation using GPT-4 and Claude-3.5-Sonnet.
- Task identification for appropriate response formatting.
- Multi-stage filtering to ensure linguistic quality and cultural sensitivity.
Refer to Figure 1 for the Bangla-Instruct generation pipeline.
4. TigerLLM
TigerLLM is built by leveraging the strengths of both the Bangla-TextBook corpus and the Bangla-Instruct dataset. The training process involves:
- Continual Pretraining on the Bangla-TextBook corpus to capture language-specific nuances.
- Model Distillation via full fine-tuning (without LoRA) using Flash Attention, ensuring efficient convergence.
For details on the training pipeline, please see Figure 2 (overall pipeline), Figure 3 (pretraining loss), and Figure 4 (finetuning loss).
5. Evaluation
TigerLLM is evaluated on multiple Bangla-specific benchmarks including:
- MMLU-bn
- PangBench-bn
- BanglaQuaD
- mHumanEval-bn
- BEnQA
- BanglaRQA
The performance comparison is detailed in Table 2 below:
Model | MMLU-bn | PangBench-bn | BanglaQuaD | mHumanEval-bn | BEnQA | BanglaRQA |
---|---|---|---|---|---|---|
GPT3.5 | 0.55 | 0.55 | 0.50 | 0.56 | 0.50 | 0.49 |
Gemini-Flash1.5 | 0.66 | 0.57 | 0.62 | 0.58 | 0.56 | 0.61 |
GPT4o-mini | 0.67 | 0.62 | 0.65 | 0.56 | 0.60 | 0.60 |
LLaMA3.2 (11B) | 0.22 | 0.19 | 0.21 | 0.15 | 0.18 | 0.20 |
Gemma 2 (27B) | 0.35 | 0.51 | 0.43 | 0.64 | 0.50 | 0.56 |
Pangea (7B) | 0.18 | 0.15 | 0.17 | 0.10 | 0.14 | 0.16 |
Titu-LLM | 0.06 | 0.19 | 0.08 | 0.02 | 0.17 | 0.21 |
Bong-LLaMA | 0.05 | 0.12 | 0.08 | 0.02 | 0.15 | 0.13 |
Bangla-LLaMA | 0.02 | 0.08 | 0.05 | 0.10 | 0.11 | 0.09 |
Bangla-Gemma | 0.18 | 0.15 | 0.12 | 0.10 | 0.22 | 0.19 |
TigerLLM (1B) | 0.61 | 0.55 | 0.68 | 0.61 | 0.59 | 0.62 |
TigerLLM (9B) | 0.72 | 0.68 | 0.70 | 0.63 | 0.65 | 0.68 |
6. Conclusion and Future Work
This paper presents TigerLLM, a family of Bangla language models that set new benchmarks by leveraging two high-quality datasets: the Bangla-TextBook corpus and the Bangla-Instruct dataset. Future work will involve qualitative analyses, expanding the corpus, scaling model sizes, and developing more sophisticated evaluation metrics.
Limitations
While TigerLLM demonstrates impressive performance, limitations remain. The Bangla-TextBook corpus is restricted to Grades 6β12 and may not capture broader linguistic nuances, and the Bangla-Instruct dataset covers a limited subset of instruction types. Additionally, the models are currently limited to 1B and 9B parameters due to computational constraints.
Ethical Considerations
Our approach emphasizes ethical practices by using open-source educational materials, ensuring cultural sensitivity via volunteer contributions, and applying rigorous filtering methods to avoid harmful biases. Users should implement further safeguards when deploying TigerLLM in sensitive applications.
References
- Alam, F., Chowdhury, S. A., et al. (2024). LLMs for low resource languages in multilingual settings.
- Bai, Y., Jones, A., et al. (2024). Claude 3.5 Sonnet Technical Report.
- Bhattacharjee, A., Hasan, T., et al. (2022). BanglaBERT: Language model pretraining and benchmarks for Bangla.
- Brown, T., Mann, B., et al. (2023). GPT-4 Technical Report.
- Brown, T., Mann, B., et al. (2020). Language models are few-shot learners.
- Chowdhery, A., Narang, S., et al. (2022). PaLM: Scaling language modeling with pathways.
- Corso, F., Pierri, F., et al. (2024). TikTokenizer research.
- Dubey, A., Jauhri, A., et al. (2024). The LLaMA 3 herd of models.
- Ekram, S. M. S., Rahman, A. A., et al. (2022). BanglaRQA benchmark.
- Gunasekar, S., et al. (2023). Textbooks are all you need.
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network.
- Hu, E. J., Wallis, P., et al. Lora: Low-rank adaptation of large language models.
- Mitra, A., Del Corro, L., et al. (2023). Orca 2: Teaching small language models how to reason.
- Ortiz SuΓ‘rez, P. J., Romary, L., & Sagot, B. Contextualized word embeddings for mid-resource languages.
- Raihan, N., Anastasopoulos, A., & Zampieri, M. (2024). mHumanEval β A multilingual benchmark for code generation.
- Rony, M. R. A. H., et al. (2024). BanglaQuaD: A Bangla open-domain question answering dataset.
- Shafayat, S., et al. (2024). BEnQA: A benchmark for Bangla question answering and reasoning.
- Taori, R., Gulrajani, I., et al. (2023). Alpaca: A replicable instruction-following model.
- Team, G., et al. (2024). Gemma 2: Improving open language models at a practical size.
- Wang, Y., et al. (2023). Self-instruct: Aligning language models with self-generated instructions.
- Wang, Y., et al. (2024). MMLU-Pro: A robust multi-task language understanding benchmark.
- Yue, X., et al. (2024). Pangea: A fully open multilingual multimodal LLM for 39 languages.
- Zehady, A. K., et al. (2024). BongLLama: Llama for Bangla language.
- Zhang, Y., et al. (2023). Llama: Open and efficient foundation language models.
Appendix A: Bangla-Instruct Curation
A.1 Volunteer Information
Seed tasks were created by 50 volunteers from various Bangladeshi universities:
- 15 from Computer Science and Engineering
- 10 from Bengali Literature
- 10 from Business Administration
- 8 from Science and Engineering
- 7 from Social Sciences
A.2 The Seed Dataset
The seed dataset covers 10 categories:
- Cultural Knowledge and Heritage
- Academic Writing
- Mathematical Problem Solving
- Programming and Technical
- Creative Writing
- Scientific Explanation
- Business and Economics
- Social Issues Analysis
- Data Analysis and Statistics
- Language and Translation
A.3 Filtering Methodology
Filtering is based on:
- Language Adherence: High Bengali word ratio, Unicode consistency, and grammar score β₯ 0.8.
- Cultural Sensitivity: Ensuring religious neutrality, regional inclusivity, gender balance, and political neutrality.
- Content Quality: Minimum length, coherence between instruction and response, factual accuracy, and proper formatting.
- Novelty Verification: Ensuring low similarity with existing tasks and sufficient lexical diversity.
Appendix B: Experimentation Details
B.1 Experimental Setup
Pretraining was conducted on a Lambda Labs cluster with 8 NVIDIA A100 GPUs (40GB each), 512GB RAM, and 2TB storage (~120 hours with gradient checkpointing). Finetuning was performed on a single NVIDIA A100 GPU via Google Colab (~96 hours).
B.2 Pretraining Hyperparameters (Table 3)
Hyperparameter | Value |
---|---|
Per device train batch size | 64 |
Gradient accumulation steps | 16 |
Number of training epochs | 4 |
Learning rate | 5Γ10-6 |
FP16 | False |
BF16 | True |
Dataloader num workers | 8 |
Gradient checkpointing | True |
Logging steps | 1000 |
DDP find unused parameters | False |
Max gradient norm | 1.0 |
Warmup steps | 1000 |
Evaluation strategy | steps |
Evaluation steps | 1,000 |
Save strategy | steps |
Save steps | 1,000 |
Save total limit | 3 |
Load best model at end | True |
Metric for best model loss | False |
B.3 Finetuning Hyperparameters
Finetuning settings for TigerLLM (1B) and (9B) are detailed in Tables 4 and 5.
Parameter | TigerLLM (1B) |
---|---|
Max Sequence Length | 2048 |
Batch Size (Train/Eval) | 16 |
Gradient Accumulation Steps | 4 |
Number of Epochs | 3 |
Learning Rate | 1e-5 |
Weight Decay | 0.02 |
Warmup Steps | 10% |
Optimizer | AdamW (8-bit) |
LR Scheduler | Cosine |
Precision | BF16 |
Evaluation Steps | 50 |
Seed | 42 |
Parameter | TigerLLM (9B) |
---|---|
Max Sequence Length | 2048 |
Batch Size (Train/Eval) | 32 |
Gradient Accumulation Steps | 8 |
Number of Epochs | 3 |
Learning Rate | 1e-6 |
Weight Decay | 0.04 |
Warmup Steps | 15% |
Optimizer | AdamW (8-bit) |
LR Scheduler | Cosine |
Precision | BF16 |
Evaluation Steps | 250 |
Seed | 42 |
Appendix C: TigerLLM - Training Pipeline
Figure 2 illustrates the multi-stage training pipeline for producing both TigerLLM (1B) and TigerLLM (9B). The process begins with pre-trained models (LLaMA 3.2 and Gemma-2), followed by continual pretraining on the Bangla-TextBook corpus and subsequent finetuning on the Bangla-Instruct dataset. Figures 3 and 4 depict the loss curves during the pretraining and finetuning stages respectively.
- Downloads last month
- -