Accepted in ACL Main 2025

TigerLLM - A Family of Bangla Large Language Models

Nishat Raihan, Marcos Zampieri

George Mason University, VA, USA

[email protected]


If you find our work helpful, please consider citing our paper:

@inproceedings{raihan-zampieri-2025-tigerllm,
    title = "{T}iger{LLM} - A Family of {B}angla Large Language Models",
    author = "Raihan, Nishat  and
      Zampieri, Marcos",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-short.69/",
    doi = "10.18653/v1/2025.acl-short.69",
    pages = "887--896",
    ISBN = "979-8-89176-252-7"
}

Abstract

The development of Large Language Models (LLMs) remains heavily skewed towards English and a few other high-resource languages. This linguistic disparity is particularly evident for Bangla – the 5th most spoken language. A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM – a family of Bangla LLMs. Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5 across standard benchmarks, establishing TigerLLM as the new baseline for future Bangla language modeling.


1. Introduction

LLMs have fundamentally transformed NLP by achieving exceptional performance across a wide range of tasks. However, their advancements have predominantly benefited high-resource languages. Despite having about 237 million native Bangla speakers, Bangla remains underserved in modern NLP due to the lack of high-quality training data and reproducible methodologies.

1.1 Limitations of Bangla LLM Initiatives

Recent efforts (e.g., titu-Gemma, titu-LLaMA, Bangla-LLaMA, G2B) suffer from low reproducibility, suboptimal performance, and poor documentation. Many rely on translated synthetic datasets, leading to compromised instruction quality.

Base-LLM Size Pretraining
(pt)
Corpora Finetuning
(ft)
Finetune Dataset Paper/Report? Reproducibility?
titu-Gemma (Gemma-2) 2B 4.4B βœ• βœ• βœ• βœ• βœ•
titu-LLaMA (LLaMA-3.1) 3B 37B βœ• βœ• βœ• βœ• βœ•
Bangla-LLaMA (LLaMA-3.2) 3B βœ“ βœ• 172K
(Orca-translated)
βœ“ βœ• βœ•
G2B (Gemma-2) 9B βœ• βœ• 145K
(Alpaca-translated)
βœ• βœ• βœ•
Bangla-LLaMA (LLaMA-2) 13B βœ“ βœ• 145K
(Alpaca-translated)
βœ• βœ• βœ•
TigerLLM (LLaMA-3.2) 1B 10M Bangla-TextBook 100K
(Bangla-Instruct)
βœ“ βœ“
TigerLLM (Gemma-2) 9B 10M Bangla-TextBook 100K
(Bangla-Instruct)
βœ“ βœ“

1.2 Contributions

  • Bangla-TextBook Corpus: A 10M-token corpus of high-quality educational texts.
  • Bangla-Instruct Dataset: 100K native Bangla instruction-response pairs generated via self-instruct and advanced teacher models.
  • TigerLLM Models: A family of models (1B and 9B parameters) that achieve significant performance improvements over existing alternatives.

2. Bangla-TextBook Corpus

The Bangla-TextBook corpus is compiled exclusively from open-source educational materials provided by the National Curriculum and Textbook Board of Bangladesh. It aggregates texts from 163 textbooks for Grades 6–12, yielding 9,897,623 tokens and 697,903 sentences, capturing authentic academic language use.


3. Bangla-Instruct

To overcome previous limitations, the Bangla-Instruct dataset contains 100,000 instruction-response pairs generated using a self-instruct framework. Key steps include:

  1. Seed Task Generation: 500 tasks curated by 50 volunteers from diverse academic backgrounds.
  2. New instruction generation using GPT-4 and Claude-3.5-Sonnet.
  3. Task identification for appropriate response formatting.
  4. Multi-stage filtering to ensure linguistic quality and cultural sensitivity.

Refer to Figure 1 for the Bangla-Instruct generation pipeline.


4. TigerLLM

TigerLLM is built by leveraging the strengths of both the Bangla-TextBook corpus and the Bangla-Instruct dataset. The training process involves:

  • Continual Pretraining on the Bangla-TextBook corpus to capture language-specific nuances.
  • Model Distillation via full fine-tuning (without LoRA) using Flash Attention, ensuring efficient convergence.

For details on the training pipeline, please see Figure 2 (overall pipeline), Figure 3 (pretraining loss), and Figure 4 (finetuning loss).


5. Evaluation

TigerLLM is evaluated on multiple Bangla-specific benchmarks including:

  • MMLU-bn
  • PangBench-bn
  • BanglaQuaD
  • mHumanEval-bn
  • BEnQA
  • BanglaRQA

The performance comparison is detailed in Table 2 below:

Model MMLU-bn PangBench-bn BanglaQuaD mHumanEval-bn BEnQA BanglaRQA
GPT3.5 0.55 0.55 0.50 0.56 0.50 0.49
Gemini-Flash1.5 0.66 0.57 0.62 0.58 0.56 0.61
GPT4o-mini 0.67 0.62 0.65 0.56 0.60 0.60
LLaMA3.2 (11B) 0.22 0.19 0.21 0.15 0.18 0.20
Gemma 2 (27B) 0.35 0.51 0.43 0.64 0.50 0.56
Pangea (7B) 0.18 0.15 0.17 0.10 0.14 0.16
Titu-LLM 0.06 0.19 0.08 0.02 0.17 0.21
Bong-LLaMA 0.05 0.12 0.08 0.02 0.15 0.13
Bangla-LLaMA 0.02 0.08 0.05 0.10 0.11 0.09
Bangla-Gemma 0.18 0.15 0.12 0.10 0.22 0.19
TigerLLM (1B) 0.61 0.55 0.68 0.61 0.59 0.62
TigerLLM (9B) 0.72 0.68 0.70 0.63 0.65 0.68

6. Conclusion and Future Work

This paper presents TigerLLM, a family of Bangla language models that set new benchmarks by leveraging two high-quality datasets: the Bangla-TextBook corpus and the Bangla-Instruct dataset. Future work will involve qualitative analyses, expanding the corpus, scaling model sizes, and developing more sophisticated evaluation metrics.


Limitations

While TigerLLM demonstrates impressive performance, limitations remain. The Bangla-TextBook corpus is restricted to Grades 6–12 and may not capture broader linguistic nuances, and the Bangla-Instruct dataset covers a limited subset of instruction types. Additionally, the models are currently limited to 1B and 9B parameters due to computational constraints.


Ethical Considerations

Our approach emphasizes ethical practices by using open-source educational materials, ensuring cultural sensitivity via volunteer contributions, and applying rigorous filtering methods to avoid harmful biases. Users should implement further safeguards when deploying TigerLLM in sensitive applications.


References

  • Alam, F., Chowdhury, S. A., et al. (2024). LLMs for low resource languages in multilingual settings.
  • Bai, Y., Jones, A., et al. (2024). Claude 3.5 Sonnet Technical Report.
  • Bhattacharjee, A., Hasan, T., et al. (2022). BanglaBERT: Language model pretraining and benchmarks for Bangla.
  • Brown, T., Mann, B., et al. (2023). GPT-4 Technical Report.
  • Brown, T., Mann, B., et al. (2020). Language models are few-shot learners.
  • Chowdhery, A., Narang, S., et al. (2022). PaLM: Scaling language modeling with pathways.
  • Corso, F., Pierri, F., et al. (2024). TikTokenizer research.
  • Dubey, A., Jauhri, A., et al. (2024). The LLaMA 3 herd of models.
  • Ekram, S. M. S., Rahman, A. A., et al. (2022). BanglaRQA benchmark.
  • Gunasekar, S., et al. (2023). Textbooks are all you need.
  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network.
  • Hu, E. J., Wallis, P., et al. Lora: Low-rank adaptation of large language models.
  • Mitra, A., Del Corro, L., et al. (2023). Orca 2: Teaching small language models how to reason.
  • Ortiz SuΓ‘rez, P. J., Romary, L., & Sagot, B. Contextualized word embeddings for mid-resource languages.
  • Raihan, N., Anastasopoulos, A., & Zampieri, M. (2024). mHumanEval – A multilingual benchmark for code generation.
  • Rony, M. R. A. H., et al. (2024). BanglaQuaD: A Bangla open-domain question answering dataset.
  • Shafayat, S., et al. (2024). BEnQA: A benchmark for Bangla question answering and reasoning.
  • Taori, R., Gulrajani, I., et al. (2023). Alpaca: A replicable instruction-following model.
  • Team, G., et al. (2024). Gemma 2: Improving open language models at a practical size.
  • Wang, Y., et al. (2023). Self-instruct: Aligning language models with self-generated instructions.
  • Wang, Y., et al. (2024). MMLU-Pro: A robust multi-task language understanding benchmark.
  • Yue, X., et al. (2024). Pangea: A fully open multilingual multimodal LLM for 39 languages.
  • Zehady, A. K., et al. (2024). BongLLama: Llama for Bangla language.
  • Zhang, Y., et al. (2023). Llama: Open and efficient foundation language models.

Appendix A: Bangla-Instruct Curation

A.1 Volunteer Information

Seed tasks were created by 50 volunteers from various Bangladeshi universities:

  • 15 from Computer Science and Engineering
  • 10 from Bengali Literature
  • 10 from Business Administration
  • 8 from Science and Engineering
  • 7 from Social Sciences
Each volunteer contributed 10 diverse instructions, resulting in 500 seed tasks.

A.2 The Seed Dataset

The seed dataset covers 10 categories:

  1. Cultural Knowledge and Heritage
  2. Academic Writing
  3. Mathematical Problem Solving
  4. Programming and Technical
  5. Creative Writing
  6. Scientific Explanation
  7. Business and Economics
  8. Social Issues Analysis
  9. Data Analysis and Statistics
  10. Language and Translation
Each category is represented with approximately 50 tasks.

A.3 Filtering Methodology

Filtering is based on:

  • Language Adherence: High Bengali word ratio, Unicode consistency, and grammar score β‰₯ 0.8.
  • Cultural Sensitivity: Ensuring religious neutrality, regional inclusivity, gender balance, and political neutrality.
  • Content Quality: Minimum length, coherence between instruction and response, factual accuracy, and proper formatting.
  • Novelty Verification: Ensuring low similarity with existing tasks and sufficient lexical diversity.
A pair (i, r) is accepted only if all criteria are met.


Appendix B: Experimentation Details

B.1 Experimental Setup

Pretraining was conducted on a Lambda Labs cluster with 8 NVIDIA A100 GPUs (40GB each), 512GB RAM, and 2TB storage (~120 hours with gradient checkpointing). Finetuning was performed on a single NVIDIA A100 GPU via Google Colab (~96 hours).

B.2 Pretraining Hyperparameters (Table 3)

Hyperparameter Value
Per device train batch size 64
Gradient accumulation steps 16
Number of training epochs 4
Learning rate 5Γ—10-6
FP16 False
BF16 True
Dataloader num workers 8
Gradient checkpointing True
Logging steps 1000
DDP find unused parameters False
Max gradient norm 1.0
Warmup steps 1000
Evaluation strategy steps
Evaluation steps 1,000
Save strategy steps
Save steps 1,000
Save total limit 3
Load best model at end True
Metric for best model loss False

B.3 Finetuning Hyperparameters

Finetuning settings for TigerLLM (1B) and (9B) are detailed in Tables 4 and 5.

Parameter TigerLLM (1B)
Max Sequence Length 2048
Batch Size (Train/Eval) 16
Gradient Accumulation Steps 4
Number of Epochs 3
Learning Rate 1e-5
Weight Decay 0.02
Warmup Steps 10%
Optimizer AdamW (8-bit)
LR Scheduler Cosine
Precision BF16
Evaluation Steps 50
Seed 42
Parameter TigerLLM (9B)
Max Sequence Length 2048
Batch Size (Train/Eval) 32
Gradient Accumulation Steps 8
Number of Epochs 3
Learning Rate 1e-6
Weight Decay 0.04
Warmup Steps 15%
Optimizer AdamW (8-bit)
LR Scheduler Cosine
Precision BF16
Evaluation Steps 250
Seed 42

Appendix C: TigerLLM - Training Pipeline

Figure 2 illustrates the multi-stage training pipeline for producing both TigerLLM (1B) and TigerLLM (9B). The process begins with pre-trained models (LLaMA 3.2 and Gemma-2), followed by continual pretraining on the Bangla-TextBook corpus and subsequent finetuning on the Bangla-Instruct dataset. Figures 3 and 4 depict the loss curves during the pretraining and finetuning stages respectively.

Downloads last month
-
Safetensors
Model size
1,000M params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Space using md-nishat-008/TigerLLM-1B-it 1