Arabic Message Classification Model

Model Description

This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on morit/arabic_xlm_xnli and has been fine-tuned on a custom dataset of 5,000 Arabic messages.

Model Details

Base Model: morit/arabic_xlm_xnli
Architecture: XLMRobertaForSequenceClassification
Language: Arabic (MSA and Iraqi dialect)
Task: Text Classification
Number of Labels: 4
Model Size: ~280M parameters

Labels

The model classifies messages into four categories:

Label ID	Label Name	Description	Examples
0	greeting	Greetings and salutations	"السلام عليكم", "هلو", "مرحبا"
1	question	Questions and inquiries	"كيف حالك؟", "شلونك؟", "متى الاجتماع؟"
2	complaint	Complaints and problems	"عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل"
3	general	General statements	"أحب القراءة", "أعمل مهندساً", "أسافر كثيراً"

Training Data

The model was trained on a custom dataset containing:

5,000 Arabic messages (50% MSA, 50% Iraqi dialect)
Balanced distribution: 1,250 examples per class
Train/Test Split: 90%/10%

Training Details

Training Epochs: 20
Batch Size: 8 (training), 16 (evaluation)
Learning Rate: Default AdamW optimizer
Maximum Sequence Length: 128 tokens
Evaluation Strategy: Every 500 steps

Usage

Using Transformers Pipeline

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a classification pipeline
classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer
)

# Classify a message
text = "السلام عليكم ورحمة الله"
result = classifier(text)
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")

Using the Model Directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize input
text = "شلونك اليوم؟"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = predictions.argmax().item()
    confidence = predictions.max().item()

# Map to label names
id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
predicted_label = id2label[predicted_class_id]

print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")

Gradio Web Interface

import gradio as gr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load model
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create classifier
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

def classify_text(text):
    result = classifier(text)[0]
    return result["label"], float(result["score"])

# Create Gradio interface
iface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
    outputs=[
        gr.Textbox(label="Predicted Label"),
        gr.Number(label="Confidence")
    ],
    title="Arabic Message Classifier",
    description="Classify Arabic messages into: greeting, question, complaint, or general."
)

iface.launch()

Model Performance

The model achieves good performance on the test set, particularly effective at:

Distinguishing between greetings and general statements
Identifying questions in both MSA and Iraqi dialect
Classifying complaints and technical issues
Handling mixed dialectal variations

Supported Dialects

Modern Standard Arabic (MSA): Formal Arabic text
Iraqi Dialect: Colloquial Iraqi Arabic expressions and vocabulary

Limitations

The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
Limited to 4 predefined categories
Performance depends on the similarity of input text to training data patterns
Maximum input length is 128 tokens

Ethical Considerations

This model is intended for text classification purposes and should be used responsibly. Users should be aware that:

The model may reflect biases present in the training data
Performance may vary across different Arabic dialects not represented in training
The model should not be used for sensitive applications without proper validation

Citation

If you use this model in your research, please cite:

@misc{arabic-mi-classifier,
  title={Arabic Message Classification Model},
  author={Ahmed Majid},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}

Model Card

For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.

Contact

For questions or issues, please contact [email protected] or create an issue in the model repository.

License

This model is released under the MIT License, same as the base model morit/arabic_xlm_xnli.

ahmedmajid92
/

Arabic_MI_Classifier