DistilBERT Multilingual Language Identification Model for Chatbots

This repository contains a fine-tuned distilbert-base-multilingual-cased transformer model for Language Identification (LangID) on the WiLI-2018 dataset. The model is optimized for use in multilingual chatbots, capable of identifying the language of user input across 200+ languages.

📌 Use Case

The model is designed for multilingual chatbot applications, where automatic detection of user language is required to:

Route queries to appropriate NLP modules
Personalize responses in the user’s native language
Enable intelligent fallback to translation pipelines

🧠 Model Details

Model Architecture: DistilBERT Multilingual (distilbert-base-multilingual-cased)
Task: Language Identification
Dataset: WiLI-2018 (Wikipedia Language Identification, 235 languages)
Fine-tuning Framework: Hugging Face Transformers
Input Format: A single sentence or user utterance
Output: ISO 639-3 language code

📦 Usage

🔧 Installation

pip install transformers datasets torch

🚀 Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model
model_path = "path_to_your_fine_tuned_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()

# Sample inputs
sentences = [
    "This is an English sentence.",
    "यह एक हिंदी वाक्य है।",
    "Ceci est une phrase en français.",
    "Das ist ein deutscher Beispielsatz."
]

# Predict
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
    preds = torch.argmax(outputs.logits, dim=1)

# Map label to ISO language code (example)
label2lang = {0: "eng", 1: "hin", 2: "fra", 3: "deu", ...}  # Complete mapping from training

for s, p in zip(sentences, preds):
    print(f"'{s}' ➜ Predicted Language Code: {label2lang[p.item()]}")

📊 Performance Metrics

Accuracy: 0.965413
F1 Score: 0.965528
Precision: 0.966185
Recall: 0.965413

🗃️ Dataset: WiLI-2018

Source: Wikipedia articles
Languages: 235 ISO 639-3 codes
Samples per language: ~2,000
Text Type: Encyclopedic, single-paragraph entries

⚙️ Fine-Tuning Configuration

Epochs: 3
Batch size: 16
Learning rate: 2e-5
Max sequence length: 128
Evaluation strategy: per epoch
Loss function: CrossEntropy

🔄 Quantization (Optional)

Post-training quantization can be applied using:

torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

📁 Repository Structure

.
├── config.json
├── tokenizer_config.json
├── special_tokens_map.json
├── tokenizer.json
├── model.bin / model.safetensors
├── README.md

🚫 Limitations

May have lower accuracy for extremely low-resource languages
Performance may degrade on noisy, code-switched, or informal text
Assumes input is a single-language sentence

🤝 Contributing

Contributions and improvements are welcome! Please open an issue or pull request if you'd like to help enhance this model or its documentation.