---
library_name: transformers
tags:
- MoroccanArabic
- Darija
- GemMaroc
datasets:
- GemMaroc/TULU-3-50k-darija-english
language:
- ar
- ary
- en
base_model:
- google/gemma-3-27b-it
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


# GemMaroc‑27B

Unlocking **Moroccan Darija** proficiency in a state‑of‑the‑art large language model, trained with a *minimal‑data, green‑AI* recipe that preserves Gemma‑27B’s strong reasoning abilities while adding fluent Darija generation.

---

## Model at a glance

|                     | Details                                                                                                                       |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| **Model ID**        | `AbderrahmanSkiredj1/GemMaroc-27b-it`                                                                                         |
| **Base model**      | [`google/gemma-3-27b`](https://huggingface.co/google/gemma-3-27b)                                                             |
| **Architecture**    | Decoder‑only Transformer (Gemma 3)                                                                                            |
| **Parameters**      | 27 billion                                                                                                                    |
| **Context length**  | 2 048 tokens                                                                                                                  |
| **Training regime** | Supervised fine‑tuning (LoRA → merged) on 50 K high‑quality Darija/English instructions TULU‑50K slice |
| **Compute budget**  | 48 GPU·h (8 × H100‑80GB × 6 h) – ≈ 26 kWh / 10 kg CO₂e                                                                        |
| **License**         | Apache 2.0                                                                                                                    |

---

## Why another Darija model?

* **Inclusive AI** > 36 million speakers of Moroccan Arabic remain underserved by open LLMs.
* **Quality‑over‑quantity** A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross‑lingual reasoning.
* **Green AI** GemMaroc achieves Atlas‑Chat‑level Darija scores using < 2 % of the energy.

---

## Benchmark summary

| Model            | Darija MMLU | Darija HellaSwag | GSM8K @5   | HellaSwag (EN) |
| ---------------- | ----------- | ---------------- | ---------- | -------------- |
| Atlas‑Chat‑27B   | **61.9 %**  | 48.4 %           | 82.0 %     | 77.8 %         |
| **GemMaroc‑27B** | 61.6 %      | **60.5 %**       | **84.2 %** | **79.3 %**     |

<sub>Zero‑shot accuracy; full table in the paper.</sub>

---

## Quick start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "AbderrahmanSkiredj1/GemMaroc-27b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=1024,
    temperature=0.7,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)

messages = [
    {"role": "user", "content": "شنو هي نظرية ‘butterfly effect’؟ فسّرها بدارجة ونقّط مثال بسيط."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(pipe(prompt)[0]["generated_text"][len(prompt):])
```

### Chat template (Gemma 3 format)

The tokenizer provides a baked‑in Jinja template that starts with a **begin‑of‑sequence** token (`<bos>`), then alternates user/model turns, each wrapped by `<start_of_turn>` … `<end_of_turn>` markers. When you set `add_generation_prompt=True` it ends after the opening model tag so the model can continue:

```
<bos><start_of_turn>user
{user message}<end_of_turn>
<start_of_turn>model
```

The assistant will keep generating tokens until it decides to emit `<end_of_turn>`.

```python
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
```

No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically.

---

Pre‑quantised checkpoints will be published under the same repo tags (`gemmaroc‑27b‑awq‑int4`, `gemmaroc‑27b‑gguf‑q4_k_m`).

---

## Training recipe (one‑paragraph recap)

1. **Data** Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross‑lingual robustness.
2. **LoRA SFT** Rank 16, α = 32, 3 epochs, bf16, context 2 048.
3. **Merge & push** Merge LoRA into base weights (`peft.merge_and_unload`), convert to safetensors, upload.

---

## Limitations & ethical considerations

* Sentiment and abstractive summarisation still trail state‑of‑the‑art.
* Tokeniser is unchanged; rare Darija spellings may fragment.
* Model may inherit societal biases present in pre‑training data.
* No RLHF / RLAIF safety alignment yet – apply a moderation layer in production.

---

## Citation

If you use GemMaroc in your work, please cite:

```bibtex
@misc{skiredj2025gemmarocunlockingdarijaproficiency,
      title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data}, 
      author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada},
      year={2025},
      eprint={2505.17082},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.17082}, 
}


```