IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data
Overview
IndicPhi-mini is a fine-tuned version of Microsoft’s Phi-mini-MoE, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as QLoRA-based quantization and LoRA adapters, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent 3–4% accuracy improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data. a compact Mixture-of-Experts (MoE) model
Key Contributions
- Curated one of the largest Indic corpora to date: 561M samples → cleaned into 29M high-quality samples across 13 Indic languages.
- Fine-tuned Phi-mini-MoE (7.6B params, 2.4B active) using QLoRA (4-bit) and LoRA adapters, making training feasible on a single A100-80GB GPU.
- Achieved +3–4% accuracy improvements on major Indic benchmarks:
- ARC-Challenge-Indic (reasoning tasks)
- MMLU-Indic (knowledge & domain understanding)
- Improved generalization across multiple Indic languages including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.
Model Architecture
- Base model: Phi-mini-MoE-Instruct (Microsoft)
- Parameters: 7.6B total (2.4B active per token)
- Layers: 32 decoder-only transformer blocks
- Attention: Grouped Query Attention (GQA)
- Experts per layer: 16 (Top-2 active per token)
- Context length: 4096 tokens
Usage
To load the fine-tuned model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "SandLogicTechnologies/IndicPhi-mini"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_4bit=True
)
prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Dataset Preparation
Data Sources
- Total collected: 561M samples from 53 datasets from Hugging Face.
- Languages covered: 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.
- Categories: General text, translation, instruction, conversational.
Processing Pipeline
- Manual Filtering – removed noisy, irrelevant, and malformed samples.
- Preprocessing – deduplication, language identification, normalization, minimum length filtering.
- Format Conversion – standardized into UltraChat JSON schema (multi-turn conversations).
Final Cleaned Dataset
- Size: 29M samples
Dataset Distribution (Final Cleaned)
Language | Samples |
---|---|
Hindi | 4.63M |
Kannada | 3.54M |
Telugu | 3.72M |
Tamil | 3.86M |
Marathi | 3.79M |
Malayalam | 2.81M |
Gujarati | 2.94M |
Bengali | 1.82M |
Odia | 438K |
Punjabi | 1.21M |
Assamese | 185K |
Sinhala | 64K |
Urdu | 58K |
Total curated dataset: ~29 million high-quality samples
Training Details
- Hardware: 1 × NVIDIA A100-80GB
- Precision: QLoRA (4-bit quantization)
- Batching: Effective batch size 256 (32 × 8 gradient accumulation)
- Steps: 8,500
- Optimizer: AdamW (8-bit) + cosine LR schedule + 1k warmup steps
- LoRA configuration:
- Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- r=128, α=128, dropout=0
- Final training loss: 0.48
Evaluation & Results
Benchmarks
- ARC-Challenge-Indic (reasoning)
- MMLU-Indic (knowledge & domain understanding)
Improvements
- ARC-Challenge-Indic
- Accuracy: 21.03 → 24.46 (+3.43%)
- Normalized Accuracy: 24.69 → 28.86 (+4.17%)
- MMLU-Indic
- Accuracy: 27.47 → 30.95 (+3.48%)
Results
ARC-Challenge-Indic
Language | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) |
---|---|---|
Hindi | 22.61 | 26.17 |
Kannada | 20.96 | 25.83 |
Tamil | 20.78 | 24.61 |
Telugu | 20.70 | 26.00 |
Bengali | 21.91 | 25.04 |
Gujarati | 18.17 | 21.30 |
Malayalam | 22.26 | 23.91 |
Marathi | 19.65 | 25.22 |
Odia | 22.26 | 24.17 |
Accuracy: (Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)
MMLU-Indic
Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE) |
---|---|---|
Hindi | 28.01 | 31.45 |
Kannada | 26.74 | 30.12 |
Tamil | 27.53 | 30.84 |
Telugu | 27.20 | 31.02 |
Bengali | 28.36 | 31.44 |
Gujarati | 25.91 | 29.28 |
Malayalam | 26.65 | 29.77 |
Marathi | 27.12 | 30.63 |
Odia | 27.05 | 30.45 |
Punjabi | 26.42 | 29.61 |
Assamese | 25.98 | 29.23 |
Sinhala | 24.87 | 27.66 |
Urdu | 25.44 | 28.71 |
Accuracy: (Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)
Acknowledgments
The Phi-mini-MoE-Instruct models are based on the original work by Microsoft and fine-tuned by the Sandlogic development team.
Special thanks to:
- The Microsoft team for developing and releasing the microsoft/Phi-mini-MoE-instruct model.
- The authors and organizations behind the 53 open-source datasets that made this work possible.
The complete list of dataset sources and citations is available here.
Contact
For any inquiries or support, please contact us at [email protected] or visit our Website.
- Downloads last month
- 6
Model tree for SandLogicTechnologies/IndicPhi-mini
Base model
microsoft/Phi-mini-MoE-instruct