--- license: apache-2.0 language: - ar - arz - ary base_model: - UBC-NLP/MARBERTv2 pipeline_tag: text-classification library_name: transformers datasets: - iabufarha/ar_sarcasm - Abdelrahman-Rezk/Arabic_Dialect_Identification - asas-ai/DART - arbml/ArSarcasm_v2 - evageon/IADD --- # ✍🏻 MARBERTv2 Arabic Written Dialect Classifier ## Model Overview This model is a fine-tuned version of [`UBC-NLP/MARBERTv2`](https://huggingface.co/UBC-NLP/MARBERTv2) for **Arabic written dialect classification**. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text. This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems. --- ## 📌 Model Details This model is fine-tuned from **MARBERTv2**, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions: - **MAGHREB** (North African dialects) - **LEV** (Levantine dialects) - **MSA** (Modern Standard Arabic) - **GLF** (Gulf dialects) - **EGY** (Egyptian Arabic) It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing. --- ## 📊 Labels (`id2label`) The model predicts one of the following five classes: ```json { "0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.) "1": "LEV", // Levantine dialect (Lebanon, Syria, Jordan, Palestine) "2": "MSA", // Modern Standard Arabic "3": "GLF", // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.) "4": "EGY", // Egyptian dialect } ``` --- ## 📚 Training Data The model was trained about **850,000+** Arabic sentences from **9 different publicly available datasets**, covering a wide variety of written Arabic dialects. ### Distribution by Dialect: | Dialect | Count | |-----------|----------| | GLF | 253,553 | | LEV | 243,025 | | MAGHREB | 140,887 | | EGY | 105,226 | | MSA | 83,231 | --- ## ⚙️ Training Details - **Architecture:** MARBERTv2 (BERT-based) - **Task:** Text Classification (Dialect Identification) - **Objective:** Multi-class classification with softmax over 5 dialect classes - **Tokenizer:** `UBC-NLP/MARBERTv2` --- ## 📂 Datasets Used Below is a detailed overview of the datasets used in training and/or considered during development: | **Dataset** | **Brief Description** | **Annotation strategy** | **Provided Labels** | **Current SOTA Performance** | | :------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------: | :-----------------------: | :--------------------------: | | MADAR Subtask-1 (MADAR-6) | A Collection of `parallel sentences (BTEC)` covering the dialects of `5 cities from the Arab World and MSA` in the travel domain `(10,000 sentences per city)` | Manual | 5 Arab Cities + MSA | 92.5% Accuracy | | MADAR Subtask-1 (MADAR-26) | A Collection of `parallel sentences (BTEC)` covering the dialects of `25 cities from the Arab World and MSA` in the travel domain `(2,000 sentences per city)` | Manual | 25 Arab Cities + MSA | 67.32% F1-Score | | DART | `25K tweets` that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects | Manual | 5 Arab Regions | UNK | | ArSarcasm v1 | `10,547 tweets` from `ASTD and SemEval datasets` for Sarcasm detection with the dilaect information added in | Manual | 4 Arab Regions + MSA | UNK | | ArSarcasm v2 | ArSarcasm-v2 dataset contains `15,548 Tweets` and is an extension of the original ArSarcasm dataset `(Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets)` | Manual | 4 Arab Regions + MSA | UNK | | IADD | `Five publicly available corpora` were identified, analyzed and filtered to build IADD `(AOC, DART, PADIC, SHAMI and TSAC)` | ________ | 5 Regions and 9 Countries | UNK | | QADI | `540k tweets` (30k per country on average) with a total of 8.8M words | Automatic | 18 Arab Countries | 60.6% | | AOC | The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:`AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabe’ from EGY` | Manual | 3 Arab Regions + MSA | UNK | | NADI-2020 | `25,957 Tweets` from 100 Arab provinces and 21 Arab countries | Automatic | 100 Prov. and 21 Coun. | 6.39% - 26.78% | --- ## 💡 Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة" inputs = tokenizer(text, return_tensors="pt") # Run inference with torch.inference_mode(): logits = model(**inputs).logits pred = torch.argmax(logits, dim=-1).item() print(f"Predicted Dialect: {model.config.id2label[pred]}") ``` --- ## ✨ Acknowledgements - MARBERTv2 team at UBC-NLP - Contributors of the Arabic dialect datasets used in training --- ## 📝 Citation If you use this model in your research or application, please cite: ```bibtex @misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier, author = {Ibrahim Amin}, title = {MARBERTv2 Arabic Written Dialect Classifier}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}}, } ```