IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data

Overview

IndicPhi-mini is a fine-tuned version of Microsoft’s Phi-mini-MoE, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as QLoRA-based quantization and LoRA adapters, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent 3–4% accuracy improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data. a compact Mixture-of-Experts (MoE) model

Key Contributions

Curated one of the largest Indic corpora to date: 561M samples → cleaned into 29M high-quality samples across 13 Indic languages.
Fine-tuned Phi-mini-MoE (7.6B params, 2.4B active) using QLoRA (4-bit) and LoRA adapters, making training feasible on a single A100-80GB GPU.
Achieved +3–4% accuracy improvements on major Indic benchmarks:
- ARC-Challenge-Indic (reasoning tasks)
- MMLU-Indic (knowledge & domain understanding)
Improved generalization across multiple Indic languages including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.

Model Architecture

Base model: Phi-mini-MoE-Instruct (Microsoft)
Parameters: 7.6B total (2.4B active per token)
Layers: 32 decoder-only transformer blocks
Attention: Grouped Query Attention (GQA)
Experts per layer: 16 (Top-2 active per token)
Context length: 4096 tokens

Usage

To load the fine-tuned model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "SandLogicTechnologies/IndicPhi-mini"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True
)

prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"  

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Dataset Preparation

Data Sources

Total collected: 561M samples from 53 datasets from Hugging Face.
Languages covered: 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.
Categories: General text, translation, instruction, conversational.

Processing Pipeline

Manual Filtering – removed noisy, irrelevant, and malformed samples.
Preprocessing – deduplication, language identification, normalization, minimum length filtering.
Format Conversion – standardized into UltraChat JSON schema (multi-turn conversations).

Final Cleaned Dataset

Size: 29M samples

Dataset Distribution (Final Cleaned)

Language	Samples
Hindi	4.63M
Kannada	3.54M
Telugu	3.72M
Tamil	3.86M
Marathi	3.79M
Malayalam	2.81M
Gujarati	2.94M
Bengali	1.82M
Odia	438K
Punjabi	1.21M
Assamese	185K
Sinhala	64K
Urdu	58K

Total curated dataset: ~29 million high-quality samples

Training Details

Hardware: 1 × NVIDIA A100-80GB
Precision: QLoRA (4-bit quantization)
Batching: Effective batch size 256 (32 × 8 gradient accumulation)
Steps: 8,500
Optimizer: AdamW (8-bit) + cosine LR schedule + 1k warmup steps
LoRA configuration:
- Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- r=128, α=128, dropout=0
Final training loss: 0.48

Evaluation & Results

Benchmarks

ARC-Challenge-Indic (reasoning)
MMLU-Indic (knowledge & domain understanding)

Improvements

ARC-Challenge-Indic
- Accuracy: 21.03 → 24.46 (+3.43%)
- Normalized Accuracy: 24.69 → 28.86 (+4.17%)
MMLU-Indic
- Accuracy: 27.47 → 30.95 (+3.48%)

Results

ARC-Challenge-Indic

Language	Accuracy (Phi-mini-MoE)	Accuracy (IndicPhi-mini)
Hindi	22.61	26.17
Kannada	20.96	25.83
Tamil	20.78	24.61
Telugu	20.70	26.00
Bengali	21.91	25.04
Gujarati	18.17	21.30
Malayalam	22.26	23.91
Marathi	19.65	25.22
Odia	22.26	24.17

Accuracy: (Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)

MMLU-Indic

Language	Accuracy (Phi-mini-MoE)	Accuracy (Phi-mini-MoE)
Hindi	28.01	31.45
Kannada	26.74	30.12
Tamil	27.53	30.84
Telugu	27.20	31.02
Bengali	28.36	31.44
Gujarati	25.91	29.28
Malayalam	26.65	29.77
Marathi	27.12	30.63
Odia	27.05	30.45
Punjabi	26.42	29.61
Assamese	25.98	29.23
Sinhala	24.87	27.66
Urdu	25.44	28.71

Accuracy: (Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)

Acknowledgments

The Phi-mini-MoE-Instruct models are based on the original work by Microsoft and fine-tuned by the Sandlogic development team.

Special thanks to:

The Microsoft team for developing and releasing the microsoft/Phi-mini-MoE-instruct model.
The authors and organizations behind the 53 open-source datasets that made this work possible.
The complete list of dataset sources and citations is available here.

Contact

For any inquiries or support, please contact us at [email protected] or visit our Website.

Downloads last month: 6

Safetensors

Model size

7.65B params

Tensor type

BF16

Model tree for SandLogicTechnologies/IndicPhi-mini

Base model

microsoft/Phi-mini-MoE-instruct

Finetuned

(1)

this model

SandLogicTechnologies
/

IndicPhi-mini