โœ๐Ÿป MARBERTv2 Arabic Written Dialect Classifier

Model Overview

This model is a fine-tuned version of UBC-NLP/MARBERTv2 for Arabic written dialect classification. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text.

This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems.


๐Ÿ“Œ Model Details

This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:

  • MAGHREB (North African dialects)
  • LEV (Levantine dialects)
  • MSA (Modern Standard Arabic)
  • GLF (Gulf dialects)
  • EGY (Egyptian Arabic)

It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.


๐Ÿ“Š Labels (id2label)

The model predicts one of the following five classes:

{
  "0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.)
  "1": "LEV",     // Levantine dialect (Lebanon, Syria, Jordan, Palestine)
  "2": "MSA",     // Modern Standard Arabic
  "3": "GLF",     // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.)
  "4": "EGY",      // Egyptian dialect
}

๐Ÿ“š Training Data

The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.

Distribution by Dialect:

Dialect Count
GLF 253,553
LEV 243,025
MAGHREB 140,887
EGY 105,226
MSA 83,231

โš™๏ธ Training Details

  • Architecture: MARBERTv2 (BERT-based)
  • Task: Text Classification (Dialect Identification)
  • Objective: Multi-class classification with softmax over 5 dialect classes
  • Tokenizer: UBC-NLP/MARBERTv2

๐Ÿ“‚ Datasets Used

Below is a detailed overview of the datasets used in training and/or considered during development:

Dataset Brief Description Annotation strategy Provided Labels Current SOTA Performance
MADAR Subtask-1 (MADAR-6) A Collection of parallel sentences (BTEC) covering the dialects of 5 cities from the Arab World and MSA in the travel domain (10,000 sentences per city) Manual 5 Arab Cities + MSA 92.5% Accuracy
MADAR Subtask-1 (MADAR-26) A Collection of parallel sentences (BTEC) covering the dialects of 25 cities from the Arab World and MSA in the travel domain (2,000 sentences per city) Manual 25 Arab Cities + MSA 67.32% F1-Score
DART 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects Manual 5 Arab Regions UNK
ArSarcasm v1 10,547 tweets from ASTD and SemEval datasets for Sarcasm detection with the dilaect information added in Manual 4 Arab Regions + MSA UNK
ArSarcasm v2 ArSarcasm-v2 dataset contains 15,548 Tweets and is an extension of the original ArSarcasm dataset (Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets) Manual 4 Arab Regions + MSA UNK
IADD Five publicly available corpora were identified, analyzed and filtered to build IADD (AOC, DART, PADIC, SHAMI and TSAC) ________ 5 Regions and 9 Countries UNK
QADI 540k tweets (30k per country on average) with a total of 8.8M words Automatic 18 Arab Countries 60.6%
AOC The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabeโ€™ from EGY Manual 3 Arab Regions + MSA UNK
NADI-2020 25,957 Tweets from 100 Arab provinces and 21 Arab countries Automatic 100 Prov. and 21 Coun. 6.39% - 26.78%

๐Ÿ’ก Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "ุงู„ุฏู†ูŠุง ู…ุด ู…ุณุชุงู‡ู„ุฉ ุชุฌุฑูŠ ูƒุฏู‡ุŒ ุฎุฏ ูˆู‚ุชูƒ ูˆุงุณุชู…ุชุน ุจุงู„ุญุงุฌุฉ ุงู„ุจุณูŠุทุฉ"
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.inference_mode():
    logits = model(**inputs).logits

pred = torch.argmax(logits, dim=-1).item()

print(f"Predicted Dialect: {model.config.id2label[pred]}")

โœจ Acknowledgements

  • MARBERTv2 team at UBC-NLP
  • Contributors of the Arabic dialect datasets used in training

๐Ÿ“ Citation

If you use this model in your research or application, please cite:

@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
  author = {Ibrahim Amin},
  title = {MARBERTv2 Arabic Written Dialect Classifier},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
}
Downloads last month
42
Safetensors
Model size
163M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for IbrahimAmin/marbertv2-arabic-written-dialect-classifier

Base model

UBC-NLP/MARBERTv2
Finetuned
(17)
this model

Datasets used to train IbrahimAmin/marbertv2-arabic-written-dialect-classifier