Arabic Message Classification Model

Model Description

This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on morit/arabic_xlm_xnli and has been fine-tuned on a custom dataset of 5,000 Arabic messages.

Model Details

  • Base Model: morit/arabic_xlm_xnli
  • Architecture: XLMRobertaForSequenceClassification
  • Language: Arabic (MSA and Iraqi dialect)
  • Task: Text Classification
  • Number of Labels: 4
  • Model Size: ~280M parameters

Labels

The model classifies messages into four categories:

Label ID Label Name Description Examples
0 greeting Greetings and salutations "السلام عليكم", "هلو", "مرحبا"
1 question Questions and inquiries "كيف حالك؟", "شلونك؟", "متى الاجتماع؟"
2 complaint Complaints and problems "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل"
3 general General statements "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً"

Training Data

The model was trained on a custom dataset containing:

  • 5,000 Arabic messages (50% MSA, 50% Iraqi dialect)
  • Balanced distribution: 1,250 examples per class
  • Train/Test Split: 90%/10%

Training Details

  • Training Epochs: 20
  • Batch Size: 8 (training), 16 (evaluation)
  • Learning Rate: Default AdamW optimizer
  • Maximum Sequence Length: 128 tokens
  • Evaluation Strategy: Every 500 steps

Usage

Using Transformers Pipeline

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a classification pipeline
classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer
)

# Classify a message
text = "السلام عليكم ورحمة الله"
result = classifier(text)
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")

Using the Model Directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize input
text = "شلونك اليوم؟"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = predictions.argmax().item()
    confidence = predictions.max().item()

# Map to label names
id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
predicted_label = id2label[predicted_class_id]

print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")

Gradio Web Interface

import gradio as gr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load model
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create classifier
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

def classify_text(text):
    result = classifier(text)[0]
    return result["label"], float(result["score"])

# Create Gradio interface
iface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
    outputs=[
        gr.Textbox(label="Predicted Label"),
        gr.Number(label="Confidence")
    ],
    title="Arabic Message Classifier",
    description="Classify Arabic messages into: greeting, question, complaint, or general."
)

iface.launch()

Model Performance

The model achieves good performance on the test set, particularly effective at:

  • Distinguishing between greetings and general statements
  • Identifying questions in both MSA and Iraqi dialect
  • Classifying complaints and technical issues
  • Handling mixed dialectal variations

Supported Dialects

  • Modern Standard Arabic (MSA): Formal Arabic text
  • Iraqi Dialect: Colloquial Iraqi Arabic expressions and vocabulary

Limitations

  • The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
  • Limited to 4 predefined categories
  • Performance depends on the similarity of input text to training data patterns
  • Maximum input length is 128 tokens

Ethical Considerations

This model is intended for text classification purposes and should be used responsibly. Users should be aware that:

  • The model may reflect biases present in the training data
  • Performance may vary across different Arabic dialects not represented in training
  • The model should not be used for sensitive applications without proper validation

Citation

If you use this model in your research, please cite:

@misc{arabic-mi-classifier,
  title={Arabic Message Classification Model},
  author={Ahmed Majid},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}

Model Card

For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.

Contact

For questions or issues, please contact [email protected] or create an issue in the model repository.

License

This model is released under the MIT License, same as the base model morit/arabic_xlm_xnli.

Downloads last month
13
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ahmedmajid92/Arabic_MI_Classifier

Finetuned
(1)
this model

Evaluation results

  • Accuracy on Arabic Messages Dataset (MSA + Iraqi)
    self-reported
    0.950