โ๐ป MARBERTv2 Arabic Written Dialect Classifier
Model Overview
This model is a fine-tuned version of UBC-NLP/MARBERTv2
for Arabic written dialect classification. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text.
This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems.
๐ Model Details
This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:
- MAGHREB (North African dialects)
- LEV (Levantine dialects)
- MSA (Modern Standard Arabic)
- GLF (Gulf dialects)
- EGY (Egyptian Arabic)
It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.
๐ Labels (id2label
)
The model predicts one of the following five classes:
{
"0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.)
"1": "LEV", // Levantine dialect (Lebanon, Syria, Jordan, Palestine)
"2": "MSA", // Modern Standard Arabic
"3": "GLF", // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.)
"4": "EGY", // Egyptian dialect
}
๐ Training Data
The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.
Distribution by Dialect:
Dialect | Count |
---|---|
GLF | 253,553 |
LEV | 243,025 |
MAGHREB | 140,887 |
EGY | 105,226 |
MSA | 83,231 |
โ๏ธ Training Details
- Architecture: MARBERTv2 (BERT-based)
- Task: Text Classification (Dialect Identification)
- Objective: Multi-class classification with softmax over 5 dialect classes
- Tokenizer:
UBC-NLP/MARBERTv2
๐ Datasets Used
Below is a detailed overview of the datasets used in training and/or considered during development:
Dataset | Brief Description | Annotation strategy | Provided Labels | Current SOTA Performance |
---|---|---|---|---|
MADAR Subtask-1 (MADAR-6) | A Collection of parallel sentences (BTEC) covering the dialects of 5 cities from the Arab World and MSA in the travel domain (10,000 sentences per city) |
Manual | 5 Arab Cities + MSA | 92.5% Accuracy |
MADAR Subtask-1 (MADAR-26) | A Collection of parallel sentences (BTEC) covering the dialects of 25 cities from the Arab World and MSA in the travel domain (2,000 sentences per city) |
Manual | 25 Arab Cities + MSA | 67.32% F1-Score |
DART | 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects |
Manual | 5 Arab Regions | UNK |
ArSarcasm v1 | 10,547 tweets from ASTD and SemEval datasets for Sarcasm detection with the dilaect information added in |
Manual | 4 Arab Regions + MSA | UNK |
ArSarcasm v2 | ArSarcasm-v2 dataset contains 15,548 Tweets and is an extension of the original ArSarcasm dataset (Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets) |
Manual | 4 Arab Regions + MSA | UNK |
IADD | Five publicly available corpora were identified, analyzed and filtered to build IADD (AOC, DART, PADIC, SHAMI and TSAC) |
________ | 5 Regions and 9 Countries | UNK |
QADI | 540k tweets (30k per country on average) with a total of 8.8M words |
Automatic | 18 Arab Countries | 60.6% |
AOC | The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabeโ from EGY |
Manual | 3 Arab Regions + MSA | UNK |
NADI-2020 | 25,957 Tweets from 100 Arab provinces and 21 Arab countries |
Automatic | 100 Prov. and 21 Coun. | 6.39% - 26.78% |
๐ก Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "ุงูุฏููุง ู
ุด ู
ุณุชุงููุฉ ุชุฌุฑู ูุฏูุ ุฎุฏ ููุชู ูุงุณุชู
ุชุน ุจุงูุญุงุฌุฉ ุงูุจุณูุทุฉ"
inputs = tokenizer(text, return_tensors="pt")
# Run inference
with torch.inference_mode():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print(f"Predicted Dialect: {model.config.id2label[pred]}")
โจ Acknowledgements
- MARBERTv2 team at UBC-NLP
- Contributors of the Arabic dialect datasets used in training
๐ Citation
If you use this model in your research or application, please cite:
@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
author = {Ibrahim Amin},
title = {MARBERTv2 Arabic Written Dialect Classifier},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
}
- Downloads last month
- 42
Model tree for IbrahimAmin/marbertv2-arabic-written-dialect-classifier
Base model
UBC-NLP/MARBERTv2