IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data

Overview

IndicPhi-mini is a fine-tuned version of Microsoft’s Phi-mini-MoE, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as QLoRA-based quantization and LoRA adapters, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent 3–4% accuracy improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data. a compact Mixture-of-Experts (MoE) model


Key Contributions

  • Curated one of the largest Indic corpora to date: 561M samples → cleaned into 29M high-quality samples across 13 Indic languages.
  • Fine-tuned Phi-mini-MoE (7.6B params, 2.4B active) using QLoRA (4-bit) and LoRA adapters, making training feasible on a single A100-80GB GPU.
  • Achieved +3–4% accuracy improvements on major Indic benchmarks:
    • ARC-Challenge-Indic (reasoning tasks)
    • MMLU-Indic (knowledge & domain understanding)
  • Improved generalization across multiple Indic languages including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.

Model Architecture

  • Base model: Phi-mini-MoE-Instruct (Microsoft)
  • Parameters: 7.6B total (2.4B active per token)
  • Layers: 32 decoder-only transformer blocks
  • Attention: Grouped Query Attention (GQA)
  • Experts per layer: 16 (Top-2 active per token)
  • Context length: 4096 tokens

Usage

To load the fine-tuned model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "SandLogicTechnologies/IndicPhi-mini"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True
)

prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"  

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Dataset Preparation

Data Sources

  • Total collected: 561M samples from 53 datasets from Hugging Face.
  • Languages covered: 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.
  • Categories: General text, translation, instruction, conversational.

Processing Pipeline

  1. Manual Filtering – removed noisy, irrelevant, and malformed samples.
  2. Preprocessing – deduplication, language identification, normalization, minimum length filtering.
  3. Format Conversion – standardized into UltraChat JSON schema (multi-turn conversations).

Final Cleaned Dataset

  • Size: 29M samples

Dataset Distribution (Final Cleaned)

Language Samples
Hindi 4.63M
Kannada 3.54M
Telugu 3.72M
Tamil 3.86M
Marathi 3.79M
Malayalam 2.81M
Gujarati 2.94M
Bengali 1.82M
Odia 438K
Punjabi 1.21M
Assamese 185K
Sinhala 64K
Urdu 58K

Total curated dataset: ~29 million high-quality samples


Training Details

  • Hardware: 1 × NVIDIA A100-80GB
  • Precision: QLoRA (4-bit quantization)
  • Batching: Effective batch size 256 (32 × 8 gradient accumulation)
  • Steps: 8,500
  • Optimizer: AdamW (8-bit) + cosine LR schedule + 1k warmup steps
  • LoRA configuration:
    • Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • r=128, α=128, dropout=0
  • Final training loss: 0.48

Evaluation & Results

Benchmarks

  1. ARC-Challenge-Indic (reasoning)
  2. MMLU-Indic (knowledge & domain understanding)

Improvements

  • ARC-Challenge-Indic
    • Accuracy: 21.03 → 24.46 (+3.43%)
    • Normalized Accuracy: 24.69 → 28.86 (+4.17%)
  • MMLU-Indic
    • Accuracy: 27.47 → 30.95 (+3.48%)

Results

ARC-Challenge-Indic

Language Accuracy (Phi-mini-MoE) Accuracy (IndicPhi-mini)
Hindi 22.61 26.17
Kannada 20.96 25.83
Tamil 20.78 24.61
Telugu 20.70 26.00
Bengali 21.91 25.04
Gujarati 18.17 21.30
Malayalam 22.26 23.91
Marathi 19.65 25.22
Odia 22.26 24.17

Accuracy: (Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)

MMLU-Indic

Language Accuracy (Phi-mini-MoE) Accuracy (Phi-mini-MoE)
Hindi 28.01 31.45
Kannada 26.74 30.12
Tamil 27.53 30.84
Telugu 27.20 31.02
Bengali 28.36 31.44
Gujarati 25.91 29.28
Malayalam 26.65 29.77
Marathi 27.12 30.63
Odia 27.05 30.45
Punjabi 26.42 29.61
Assamese 25.98 29.23
Sinhala 24.87 27.66
Urdu 25.44 28.71

Accuracy: (Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)

Acknowledgments

The Phi-mini-MoE-Instruct models are based on the original work by Microsoft and fine-tuned by the Sandlogic development team.

Special thanks to:

  • The Microsoft team for developing and releasing the microsoft/Phi-mini-MoE-instruct model.
  • The authors and organizations behind the 53 open-source datasets that made this work possible.
    The complete list of dataset sources and citations is available here.

Contact

For any inquiries or support, please contact us at [email protected] or visit our Website.

Downloads last month
6
Safetensors
Model size
7.65B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SandLogicTechnologies/IndicPhi-mini

Finetuned
(1)
this model

Dataset used to train SandLogicTechnologies/IndicPhi-mini