Arabic Message Classification Model
Model Description
This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on morit/arabic_xlm_xnli
and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
Model Details
- Base Model:
morit/arabic_xlm_xnli
- Architecture: XLMRobertaForSequenceClassification
- Language: Arabic (MSA and Iraqi dialect)
- Task: Text Classification
- Number of Labels: 4
- Model Size: ~280M parameters
Labels
The model classifies messages into four categories:
Label ID | Label Name | Description | Examples |
---|---|---|---|
0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
Training Data
The model was trained on a custom dataset containing:
- 5,000 Arabic messages (50% MSA, 50% Iraqi dialect)
- Balanced distribution: 1,250 examples per class
- Train/Test Split: 90%/10%
Training Details
- Training Epochs: 20
- Batch Size: 8 (training), 16 (evaluation)
- Learning Rate: Default AdamW optimizer
- Maximum Sequence Length: 128 tokens
- Evaluation Strategy: Every 500 steps
Usage
Using Transformers Pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load the model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create a classification pipeline
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer
)
# Classify a message
text = "السلام عليكم ورحمة الله"
result = classifier(text)
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
Using the Model Directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize input
text = "شلونك اليوم؟"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()
# Map to label names
id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
predicted_label = id2label[predicted_class_id]
print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")
Gradio Web Interface
import gradio as gr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load model
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create classifier
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
def classify_text(text):
result = classifier(text)[0]
return result["label"], float(result["score"])
# Create Gradio interface
iface = gr.Interface(
fn=classify_text,
inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
outputs=[
gr.Textbox(label="Predicted Label"),
gr.Number(label="Confidence")
],
title="Arabic Message Classifier",
description="Classify Arabic messages into: greeting, question, complaint, or general."
)
iface.launch()
Model Performance
The model achieves good performance on the test set, particularly effective at:
- Distinguishing between greetings and general statements
- Identifying questions in both MSA and Iraqi dialect
- Classifying complaints and technical issues
- Handling mixed dialectal variations
Supported Dialects
- Modern Standard Arabic (MSA): Formal Arabic text
- Iraqi Dialect: Colloquial Iraqi Arabic expressions and vocabulary
Limitations
- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
- Limited to 4 predefined categories
- Performance depends on the similarity of input text to training data patterns
- Maximum input length is 128 tokens
Ethical Considerations
This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
- The model may reflect biases present in the training data
- Performance may vary across different Arabic dialects not represented in training
- The model should not be used for sensitive applications without proper validation
Citation
If you use this model in your research, please cite:
@misc{arabic-mi-classifier,
title={Arabic Message Classification Model},
author={Ahmed Majid},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}
Model Card
For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
Contact
For questions or issues, please contact [email protected] or create an issue in the model repository.
License
This model is released under the MIT License, same as the base model morit/arabic_xlm_xnli
.
- Downloads last month
- 13
Model tree for ahmedmajid92/Arabic_MI_Classifier
Base model
morit/arabic_xlm_xnliEvaluation results
- Accuracy on Arabic Messages Dataset (MSA + Iraqi)self-reported0.950