Accepted in ACL Main 2025

TigerLLM - A Family of Bangla Large Language Models

Nishat Raihan, Marcos Zampieri

George Mason University, VA, USA

If you find our work helpful, please consider citing our paper:

@inproceedings{raihan-zampieri-2025-tigerllm,
    title = "{T}iger{LLM} - A Family of {B}angla Large Language Models",
    author = "Raihan, Nishat  and
      Zampieri, Marcos",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-short.69/",
    doi = "10.18653/v1/2025.acl-short.69",
    pages = "887--896",
    ISBN = "979-8-89176-252-7"
}

Abstract

The development of Large Language Models (LLMs) remains heavily skewed towards English and a few other high-resource languages. This linguistic disparity is particularly evident for Bangla – the 5th most spoken language. A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM – a family of Bangla LLMs. Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5 across standard benchmarks, establishing TigerLLM as the new baseline for future Bangla language modeling.

1. Introduction

LLMs have fundamentally transformed NLP by achieving exceptional performance across a wide range of tasks. However, their advancements have predominantly benefited high-resource languages. Despite having about 237 million native Bangla speakers, Bangla remains underserved in modern NLP due to the lack of high-quality training data and reproducible methodologies.

1.1 Limitations of Bangla LLM Initiatives

Recent efforts (e.g., titu-Gemma, titu-LLaMA, Bangla-LLaMA, G2B) suffer from low reproducibility, suboptimal performance, and poor documentation. Many rely on translated synthetic datasets, leading to compromised instruction quality.

Base-LLM	Size	Pretraining (pt)	Corpora	Finetuning (ft)	Finetune Dataset	Paper/Report?	Reproducibility?
titu-Gemma (Gemma-2)	2B	4.4B	✕	✕	✕	✕	✕
titu-LLaMA (LLaMA-3.1)	3B	37B	✕	✕	✕	✕	✕
Bangla-LLaMA (LLaMA-3.2)	3B	✓	✕	172K (Orca-translated)	✓	✕	✕
G2B (Gemma-2)	9B	✕	✕	145K (Alpaca-translated)	✕	✕	✕
Bangla-LLaMA (LLaMA-2)	13B	✓	✕	145K (Alpaca-translated)	✕	✕	✕
TigerLLM (LLaMA-3.2)	1B	10M	Bangla-TextBook	100K (Bangla-Instruct)	✓	✓
TigerLLM (Gemma-2)	9B	10M	Bangla-TextBook	100K (Bangla-Instruct)	✓	✓

1.2 Contributions

Bangla-TextBook Corpus: A 10M-token corpus of high-quality educational texts.
Bangla-Instruct Dataset: 100K native Bangla instruction-response pairs generated via self-instruct and advanced teacher models.
TigerLLM Models: A family of models (1B and 9B parameters) that achieve significant performance improvements over existing alternatives.

2. Bangla-TextBook Corpus

The Bangla-TextBook corpus is compiled exclusively from open-source educational materials provided by the National Curriculum and Textbook Board of Bangladesh. It aggregates texts from 163 textbooks for Grades 6–12, yielding 9,897,623 tokens and 697,903 sentences, capturing authentic academic language use.

3. Bangla-Instruct

To overcome previous limitations, the Bangla-Instruct dataset contains 100,000 instruction-response pairs generated using a self-instruct framework. Key steps include:

Seed Task Generation: 500 tasks curated by 50 volunteers from diverse academic backgrounds.
New instruction generation using GPT-4 and Claude-3.5-Sonnet.
Task identification for appropriate response formatting.
Multi-stage filtering to ensure linguistic quality and cultural sensitivity.

Refer to Figure 1 for the Bangla-Instruct generation pipeline.

4. TigerLLM

TigerLLM is built by leveraging the strengths of both the Bangla-TextBook corpus and the Bangla-Instruct dataset. The training process involves:

Continual Pretraining on the Bangla-TextBook corpus to capture language-specific nuances.
Model Distillation via full fine-tuning (without LoRA) using Flash Attention, ensuring efficient convergence.

For details on the training pipeline, please see Figure 2 (overall pipeline), Figure 3 (pretraining loss), and Figure 4 (finetuning loss).

5. Evaluation

TigerLLM is evaluated on multiple Bangla-specific benchmarks including:

MMLU-bn
PangBench-bn
BanglaQuaD
mHumanEval-bn
BEnQA
BanglaRQA

The performance comparison is detailed in Table 2 below:

Model	MMLU-bn	PangBench-bn	BanglaQuaD	mHumanEval-bn	BEnQA	BanglaRQA
GPT3.5	0.55	0.55	0.50	0.56	0.50	0.49
Gemini-Flash1.5	0.66	0.57	0.62	0.58	0.56	0.61
GPT4o-mini	0.67	0.62	0.65	0.56	0.60	0.60
LLaMA3.2 (11B)	0.22	0.19	0.21	0.15	0.18	0.20
Gemma 2 (27B)	0.35	0.51	0.43	0.64	0.50	0.56
Pangea (7B)	0.18	0.15	0.17	0.10	0.14	0.16
Titu-LLM	0.06	0.19	0.08	0.02	0.17	0.21
Bong-LLaMA	0.05	0.12	0.08	0.02	0.15	0.13
Bangla-LLaMA	0.02	0.08	0.05	0.10	0.11	0.09
Bangla-Gemma	0.18	0.15	0.12	0.10	0.22	0.19
TigerLLM (1B)	0.61	0.55	0.68	0.61	0.59	0.62
TigerLLM (9B)	0.72	0.68	0.70	0.63	0.65	0.68

6. Conclusion and Future Work

This paper presents TigerLLM, a family of Bangla language models that set new benchmarks by leveraging two high-quality datasets: the Bangla-TextBook corpus and the Bangla-Instruct dataset. Future work will involve qualitative analyses, expanding the corpus, scaling model sizes, and developing more sophisticated evaluation metrics.

Limitations

While TigerLLM demonstrates impressive performance, limitations remain. The Bangla-TextBook corpus is restricted to Grades 6–12 and may not capture broader linguistic nuances, and the Bangla-Instruct dataset covers a limited subset of instruction types. Additionally, the models are currently limited to 1B and 9B parameters due to computational constraints.

Ethical Considerations

Our approach emphasizes ethical practices by using open-source educational materials, ensuring cultural sensitivity via volunteer contributions, and applying rigorous filtering methods to avoid harmful biases. Users should implement further safeguards when deploying TigerLLM in sensitive applications.

References

Alam, F., Chowdhury, S. A., et al. (2024). LLMs for low resource languages in multilingual settings.
Bai, Y., Jones, A., et al. (2024). Claude 3.5 Sonnet Technical Report.
Bhattacharjee, A., Hasan, T., et al. (2022). BanglaBERT: Language model pretraining and benchmarks for Bangla.
Brown, T., Mann, B., et al. (2023). GPT-4 Technical Report.
Brown, T., Mann, B., et al. (2020). Language models are few-shot learners.
Chowdhery, A., Narang, S., et al. (2022). PaLM: Scaling language modeling with pathways.
Corso, F., Pierri, F., et al. (2024). TikTokenizer research.
Dubey, A., Jauhri, A., et al. (2024). The LLaMA 3 herd of models.
Ekram, S. M. S., Rahman, A. A., et al. (2022). BanglaRQA benchmark.
Gunasekar, S., et al. (2023). Textbooks are all you need.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network.
Hu, E. J., Wallis, P., et al. Lora: Low-rank adaptation of large language models.
Mitra, A., Del Corro, L., et al. (2023). Orca 2: Teaching small language models how to reason.
Ortiz Suárez, P. J., Romary, L., & Sagot, B. Contextualized word embeddings for mid-resource languages.
Raihan, N., Anastasopoulos, A., & Zampieri, M. (2024). mHumanEval – A multilingual benchmark for code generation.
Rony, M. R. A. H., et al. (2024). BanglaQuaD: A Bangla open-domain question answering dataset.
Shafayat, S., et al. (2024). BEnQA: A benchmark for Bangla question answering and reasoning.
Taori, R., Gulrajani, I., et al. (2023). Alpaca: A replicable instruction-following model.
Team, G., et al. (2024). Gemma 2: Improving open language models at a practical size.
Wang, Y., et al. (2023). Self-instruct: Aligning language models with self-generated instructions.
Wang, Y., et al. (2024). MMLU-Pro: A robust multi-task language understanding benchmark.
Yue, X., et al. (2024). Pangea: A fully open multilingual multimodal LLM for 39 languages.
Zehady, A. K., et al. (2024). BongLLama: Llama for Bangla language.
Zhang, Y., et al. (2023). Llama: Open and efficient foundation language models.

Appendix A: Bangla-Instruct Curation

A.1 Volunteer Information

Seed tasks were created by 50 volunteers from various Bangladeshi universities:

15 from Computer Science and Engineering
10 from Bengali Literature
10 from Business Administration
8 from Science and Engineering
7 from Social Sciences

Each volunteer contributed 10 diverse instructions, resulting in 500 seed tasks.

A.2 The Seed Dataset

The seed dataset covers 10 categories:

Cultural Knowledge and Heritage
Academic Writing
Mathematical Problem Solving
Programming and Technical
Creative Writing
Scientific Explanation
Business and Economics
Social Issues Analysis
Data Analysis and Statistics
Language and Translation

Each category is represented with approximately 50 tasks.

A.3 Filtering Methodology

Filtering is based on:

Language Adherence: High Bengali word ratio, Unicode consistency, and grammar score ≥ 0.8.
Cultural Sensitivity: Ensuring religious neutrality, regional inclusivity, gender balance, and political neutrality.
Content Quality: Minimum length, coherence between instruction and response, factual accuracy, and proper formatting.
Novelty Verification: Ensuring low similarity with existing tasks and sufficient lexical diversity.

A pair (i, r) is accepted only if all criteria are met.

Appendix B: Experimentation Details

B.1 Experimental Setup

Pretraining was conducted on a Lambda Labs cluster with 8 NVIDIA A100 GPUs (40GB each), 512GB RAM, and 2TB storage (~120 hours with gradient checkpointing). Finetuning was performed on a single NVIDIA A100 GPU via Google Colab (~96 hours).

B.2 Pretraining Hyperparameters (Table 3)

Hyperparameter	Value
Per device train batch size	64
Gradient accumulation steps	16
Number of training epochs	4
Learning rate	5×10^-6
FP16	False
BF16	True
Dataloader num workers	8
Gradient checkpointing	True
Logging steps	1000
DDP find unused parameters	False
Max gradient norm	1.0
Warmup steps	1000
Evaluation strategy	steps
Evaluation steps	1,000
Save strategy	steps
Save steps	1,000
Save total limit	3
Load best model at end	True
Metric for best model loss	False

B.3 Finetuning Hyperparameters

Finetuning settings for TigerLLM (1B) and (9B) are detailed in Tables 4 and 5.

Parameter	TigerLLM (1B)
Max Sequence Length	2048
Batch Size (Train/Eval)	16
Gradient Accumulation Steps	4
Number of Epochs	3
Learning Rate	1e-5
Weight Decay	0.02
Warmup Steps	10%
Optimizer	AdamW (8-bit)
LR Scheduler	Cosine
Precision	BF16
Evaluation Steps	50
Seed	42

Parameter	TigerLLM (9B)
Max Sequence Length	2048
Batch Size (Train/Eval)	32
Gradient Accumulation Steps	8
Number of Epochs	3
Learning Rate	1e-6
Weight Decay	0.04
Warmup Steps	15%
Optimizer	AdamW (8-bit)
LR Scheduler	Cosine
Precision	BF16
Evaluation Steps	250
Seed	42

Appendix C: TigerLLM - Training Pipeline

Figure 2 illustrates the multi-stage training pipeline for producing both TigerLLM (1B) and TigerLLM (9B). The process begins with pre-trained models (LLaMA 3.2 and Gemma-2), followed by continual pretraining on the Bangla-TextBook corpus and subsequent finetuning on the Bangla-Instruct dataset. Figures 3 and 4 depict the loss curves during the pretraining and finetuning stages respectively.

md-nishat-008
/

TigerLLM-1B-it