---
license: apache-2.0
language:
- ar
- arz
- ary
base_model:
- UBC-NLP/MARBERTv2
pipeline_tag: text-classification
library_name: transformers
datasets:
- iabufarha/ar_sarcasm
- Abdelrahman-Rezk/Arabic_Dialect_Identification
- asas-ai/DART
- arbml/ArSarcasm_v2
- evageon/IADD
---

# ✍🏻 MARBERTv2 Arabic Written Dialect Classifier

## Model Overview

This model is a fine-tuned version of [`UBC-NLP/MARBERTv2`](https://huggingface.co/UBC-NLP/MARBERTv2) for **Arabic written dialect classification**. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text.

This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems.

---

## 📌 Model Details

This model is fine-tuned from **MARBERTv2**, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:

- **MAGHREB** (North African dialects)
- **LEV** (Levantine dialects)
- **MSA** (Modern Standard Arabic)
- **GLF** (Gulf dialects)
- **EGY** (Egyptian Arabic)

It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.

---

## 📊 Labels (`id2label`)

The model predicts one of the following five classes:

```json
{
  "0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.)
  "1": "LEV",     // Levantine dialect (Lebanon, Syria, Jordan, Palestine)
  "2": "MSA",     // Modern Standard Arabic
  "3": "GLF",     // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.)
  "4": "EGY",      // Egyptian dialect
}
```

---

## 📚 Training Data

The model was trained about **850,000+** Arabic sentences from **9 different publicly available datasets**, covering a wide variety of written Arabic dialects.

### Distribution by Dialect:

| Dialect   | Count    |
|-----------|----------|
| GLF       | 253,553  |
| LEV       | 243,025  |
| MAGHREB   | 140,887  |
| EGY       | 105,226  |
| MSA       | 83,231   |

---

## ⚙️ Training Details

- **Architecture:** MARBERTv2 (BERT-based)
- **Task:** Text Classification (Dialect Identification)
- **Objective:** Multi-class classification with softmax over 5 dialect classes
- **Tokenizer:** `UBC-NLP/MARBERTv2`

---

## 📂 Datasets Used

Below is a detailed overview of the datasets used in training and/or considered during development:

|        **Dataset**         |                                                                                    **Brief Description**                                                                                    | **Annotation strategy** |    **Provided Labels**    | **Current SOTA Performance** |
| :------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------: | :-----------------------: | :--------------------------: |
| MADAR Subtask-1 (MADAR-6)  |               A Collection of `parallel sentences (BTEC)` covering the dialects of `5 cities from the Arab World and MSA` in the travel domain `(10,000 sentences per city)`                |         Manual          |    5 Arab Cities + MSA    |        92.5% Accuracy        |
| MADAR Subtask-1 (MADAR-26) |               A Collection of `parallel sentences (BTEC)` covering the dialects of `25 cities from the Arab World and MSA` in the travel domain `(2,000 sentences per city)`                |         Manual          |   25 Arab Cities + MSA    |       67.32% F1-Score        |
|            DART            |                                     `25K tweets` that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects                                      |         Manual          |      5 Arab Regions       |             UNK              |
|        ArSarcasm v1        |                                        `10,547 tweets` from `ASTD and SemEval datasets` for Sarcasm detection with the dilaect information added in                                         |         Manual          |   4 Arab Regions + MSA    |             UNK              |
|        ArSarcasm v2        |  ArSarcasm-v2 dataset contains `15,548 Tweets` and is an extension of the original ArSarcasm dataset `(Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets)`  |         Manual          |   4 Arab Regions + MSA    |             UNK              |
|            IADD            |                                 `Five publicly available corpora` were identified, analyzed and filtered to build IADD `(AOC, DART, PADIC, SHAMI and TSAC)`                                 |        ________         | 5 Regions and 9 Countries |             UNK              |
|            QADI            |                                                            `540k tweets` (30k per country on average) with a total of 8.8M words                                                            |        Automatic        |     18 Arab Countries     |            60.6%             |
|            AOC             | The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:`AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabe’ from EGY` |         Manual          |   3 Arab Regions + MSA    |             UNK              |
|         NADI-2020          |                                                                `25,957 Tweets` from 100 Arab provinces and 21 Arab countries                                                                |        Automatic        |  100 Prov. and 21 Coun.   |        6.39% - 26.78%        |

---

## 💡 Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة"
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.inference_mode():
    logits = model(**inputs).logits

pred = torch.argmax(logits, dim=-1).item()

print(f"Predicted Dialect: {model.config.id2label[pred]}")
```

---

## ✨ Acknowledgements

- MARBERTv2 team at UBC-NLP
- Contributors of the Arabic dialect datasets used in training

---

## 📝 Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
  author = {Ibrahim Amin},
  title = {MARBERTv2 Arabic Written Dialect Classifier},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
}
```