YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DistilBERT Multilingual Language Identification Model for Chatbots

This repository contains a fine-tuned distilbert-base-multilingual-cased transformer model for Language Identification (LangID) on the WiLI-2018 dataset. The model is optimized for use in multilingual chatbots, capable of identifying the language of user input across 200+ languages.

πŸ“Œ Use Case

The model is designed for multilingual chatbot applications, where automatic detection of user language is required to:

  • Route queries to appropriate NLP modules
  • Personalize responses in the user’s native language
  • Enable intelligent fallback to translation pipelines

🧠 Model Details

  • Model Architecture: DistilBERT Multilingual (distilbert-base-multilingual-cased)
  • Task: Language Identification
  • Dataset: WiLI-2018 (Wikipedia Language Identification, 235 languages)
  • Fine-tuning Framework: Hugging Face Transformers
  • Input Format: A single sentence or user utterance
  • Output: ISO 639-3 language code

πŸ“¦ Usage

πŸ”§ Installation

pip install transformers datasets torch

πŸš€ Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model
model_path = "path_to_your_fine_tuned_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()

# Sample inputs
sentences = [
    "This is an English sentence.",
    "ΰ€―ΰ€Ή ΰ€ΰ€• ΰ€Ήΰ€Ώΰ€‚ΰ€¦ΰ₯€ ΰ€΅ΰ€Ύΰ€•ΰ₯ΰ€― ΰ€Ήΰ₯ˆΰ₯€",
    "Ceci est une phrase en franΓ§ais.",
    "Das ist ein deutscher Beispielsatz."
]

# Predict
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
    preds = torch.argmax(outputs.logits, dim=1)

# Map label to ISO language code (example)
label2lang = {0: "eng", 1: "hin", 2: "fra", 3: "deu", ...}  # Complete mapping from training

for s, p in zip(sentences, preds):
    print(f"'{s}' ➜ Predicted Language Code: {label2lang[p.item()]}")

πŸ“Š Performance Metrics

  • Accuracy: 0.965413
  • F1 Score: 0.965528
  • Precision: 0.966185
  • Recall: 0.965413

πŸ—ƒοΈ Dataset: WiLI-2018

  • Source: Wikipedia articles
  • Languages: 235 ISO 639-3 codes
  • Samples per language: ~2,000
  • Text Type: Encyclopedic, single-paragraph entries

βš™οΈ Fine-Tuning Configuration

  • Epochs: 3
  • Batch size: 16
  • Learning rate: 2e-5
  • Max sequence length: 128
  • Evaluation strategy: per epoch
  • Loss function: CrossEntropy

πŸ”„ Quantization (Optional)

Post-training quantization can be applied using:

torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

πŸ“ Repository Structure

.
β”œβ”€β”€ config.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ model.bin / model.safetensors
β”œβ”€β”€ README.md

🚫 Limitations

  • May have lower accuracy for extremely low-resource languages
  • Performance may degrade on noisy, code-switched, or informal text
  • Assumes input is a single-language sentence

🀝 Contributing

Contributions and improvements are welcome! Please open an issue or pull request if you'd like to help enhance this model or its documentation.

Downloads last month
7
Safetensors
Model size
136M params
Tensor type
FP16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support