YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
DistilBERT Multilingual Language Identification Model for Chatbots
This repository contains a fine-tuned distilbert-base-multilingual-cased
transformer model for Language Identification (LangID) on the WiLI-2018 dataset. The model is optimized for use in multilingual chatbots, capable of identifying the language of user input across 200+ languages.
π Use Case
The model is designed for multilingual chatbot applications, where automatic detection of user language is required to:
- Route queries to appropriate NLP modules
- Personalize responses in the userβs native language
- Enable intelligent fallback to translation pipelines
π§ Model Details
- Model Architecture: DistilBERT Multilingual (
distilbert-base-multilingual-cased
) - Task: Language Identification
- Dataset: WiLI-2018 (Wikipedia Language Identification, 235 languages)
- Fine-tuning Framework: Hugging Face Transformers
- Input Format: A single sentence or user utterance
- Output: ISO 639-3 language code
π¦ Usage
π§ Installation
pip install transformers datasets torch
π Inference Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model
model_path = "path_to_your_fine_tuned_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()
# Sample inputs
sentences = [
"This is an English sentence.",
"ΰ€―ΰ€Ή ΰ€ΰ€ ΰ€Ήΰ€Ώΰ€ΰ€¦ΰ₯ ΰ€΅ΰ€Ύΰ€ΰ₯ΰ€― ΰ€Ήΰ₯ΰ₯€",
"Ceci est une phrase en franΓ§ais.",
"Das ist ein deutscher Beispielsatz."
]
# Predict
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
preds = torch.argmax(outputs.logits, dim=1)
# Map label to ISO language code (example)
label2lang = {0: "eng", 1: "hin", 2: "fra", 3: "deu", ...} # Complete mapping from training
for s, p in zip(sentences, preds):
print(f"'{s}' β Predicted Language Code: {label2lang[p.item()]}")
π Performance Metrics
- Accuracy: 0.965413
- F1 Score: 0.965528
- Precision: 0.966185
- Recall: 0.965413
ποΈ Dataset: WiLI-2018
- Source: Wikipedia articles
- Languages: 235 ISO 639-3 codes
- Samples per language: ~2,000
- Text Type: Encyclopedic, single-paragraph entries
βοΈ Fine-Tuning Configuration
- Epochs: 3
- Batch size: 16
- Learning rate: 2e-5
- Max sequence length: 128
- Evaluation strategy: per epoch
- Loss function: CrossEntropy
π Quantization (Optional)
Post-training quantization can be applied using:
torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
π Repository Structure
.
βββ config.json
βββ tokenizer_config.json
βββ special_tokens_map.json
βββ tokenizer.json
βββ model.bin / model.safetensors
βββ README.md
π« Limitations
- May have lower accuracy for extremely low-resource languages
- Performance may degrade on noisy, code-switched, or informal text
- Assumes input is a single-language sentence
π€ Contributing
Contributions and improvements are welcome! Please open an issue or pull request if you'd like to help enhance this model or its documentation.
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support