✍🏻 MARBERTv2 Arabic Written Dialect Classifier

Model Overview

This model is a fine-tuned version of UBC-NLP/MARBERTv2 for Arabic written dialect classification. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text.

This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems.

📌 Model Details

This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:

MAGHREB (North African dialects)
LEV (Levantine dialects)
MSA (Modern Standard Arabic)
GLF (Gulf dialects)
EGY (Egyptian Arabic)

It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.

📊 Labels (`id2label`)

The model predicts one of the following five classes:

{
  "0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.)
  "1": "LEV",     // Levantine dialect (Lebanon, Syria, Jordan, Palestine)
  "2": "MSA",     // Modern Standard Arabic
  "3": "GLF",     // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.)
  "4": "EGY",      // Egyptian dialect
}

📚 Training Data

The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.

Distribution by Dialect:

Dialect	Count
GLF	253,553
LEV	243,025
MAGHREB	140,887
EGY	105,226
MSA	83,231

⚙️ Training Details

Architecture: MARBERTv2 (BERT-based)
Task: Text Classification (Dialect Identification)
Objective: Multi-class classification with softmax over 5 dialect classes
Tokenizer: UBC-NLP/MARBERTv2

📂 Datasets Used

Below is a detailed overview of the datasets used in training and/or considered during development:

Dataset	Brief Description	Annotation strategy	Provided Labels	Current SOTA Performance
MADAR Subtask-1 (MADAR-6)	A Collection of `parallel sentences (BTEC)` covering the dialects of `5 cities from the Arab World and MSA` in the travel domain `(10,000 sentences per city)`	Manual	5 Arab Cities + MSA	92.5% Accuracy
MADAR Subtask-1 (MADAR-26)	A Collection of `parallel sentences (BTEC)` covering the dialects of `25 cities from the Arab World and MSA` in the travel domain `(2,000 sentences per city)`	Manual	25 Arab Cities + MSA	67.32% F1-Score
DART	`25K tweets` that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects	Manual	5 Arab Regions	UNK
ArSarcasm v1	`10,547 tweets` from `ASTD and SemEval datasets` for Sarcasm detection with the dilaect information added in	Manual	4 Arab Regions + MSA	UNK
ArSarcasm v2	ArSarcasm-v2 dataset contains `15,548 Tweets` and is an extension of the original ArSarcasm dataset `(Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets)`	Manual	4 Arab Regions + MSA	UNK
IADD	`Five publicly available corpora` were identified, analyzed and filtered to build IADD `(AOC, DART, PADIC, SHAMI and TSAC)`	________	5 Regions and 9 Countries	UNK
QADI	`540k tweets` (30k per country on average) with a total of 8.8M words	Automatic	18 Arab Countries	60.6%
AOC	The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:`AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabe’ from EGY`	Manual	3 Arab Regions + MSA	UNK
NADI-2020	`25,957 Tweets` from 100 Arab provinces and 21 Arab countries	Automatic	100 Prov. and 21 Coun.	6.39% - 26.78%

💡 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة"
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.inference_mode():
    logits = model(**inputs).logits

pred = torch.argmax(logits, dim=-1).item()

print(f"Predicted Dialect: {model.config.id2label[pred]}")

✨ Acknowledgements

MARBERTv2 team at UBC-NLP
Contributors of the Arabic dialect datasets used in training

📝 Citation

If you use this model in your research or application, please cite:

@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
  author = {Ibrahim Amin},
  title = {MARBERTv2 Arabic Written Dialect Classifier},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
}

IbrahimAmin
/

marbertv2-arabic-written-dialect-classifier

✍🏻 MARBERTv2 Arabic Written Dialect Classifier

Model Overview

📌 Model Details

📊 Labels (`id2label`)

📚 Training Data

Distribution by Dialect:

⚙️ Training Details

📂 Datasets Used

💡 Usage

✨ Acknowledgements

📝 Citation

Model tree for IbrahimAmin/marbertv2-arabic-written-dialect-classifier

Datasets used to train IbrahimAmin/marbertv2-arabic-written-dialect-classifier

✍🏻 MARBERTv2 Arabic Written Dialect Classifier

Model Overview

📌 Model Details

📊 Labels (id2label)

📚 Training Data

Distribution by Dialect:

⚙️ Training Details

📂 Datasets Used

💡 Usage

✨ Acknowledgements

📝 Citation

Model tree for IbrahimAmin/marbertv2-arabic-written-dialect-classifier

Datasets used to train IbrahimAmin/marbertv2-arabic-written-dialect-classifier

📊 Labels (`id2label`)