--- library_name: transformers tags: - MoroccanArabic - Darija - GemMaroc datasets: - GemMaroc/TULU-3-50k-darija-english language: - ar - ary - en base_model: - google/gemma-3-27b-it --- # Model Card for Model ID # GemMaroc‑27B Unlocking **Moroccan Darija** proficiency in a state‑of‑the‑art large language model, trained with a *minimal‑data, green‑AI* recipe that preserves Gemma‑27B’s strong reasoning abilities while adding fluent Darija generation. --- ## Model at a glance | | Details | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------- | | **Model ID** | `AbderrahmanSkiredj1/GemMaroc-27b-it` | | **Base model** | [`google/gemma-3-27b`](https://huggingface.co/google/gemma-3-27b) | | **Architecture** | Decoder‑only Transformer (Gemma 3) | | **Parameters** | 27 billion | | **Context length** | 2 048 tokens | | **Training regime** | Supervised fine‑tuning (LoRA → merged) on 50 K high‑quality Darija/English instructions TULU‑50K slice | | **Compute budget** | 48 GPU·h (8 × H100‑80GB × 6 h) – ≈ 26 kWh / 10 kg CO₂e | | **License** | Apache 2.0 | --- ## Why another Darija model? * **Inclusive AI** > 36 million speakers of Moroccan Arabic remain underserved by open LLMs. * **Quality‑over‑quantity** A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross‑lingual reasoning. * **Green AI** GemMaroc achieves Atlas‑Chat‑level Darija scores using < 2 % of the energy. --- ## Benchmark summary | Model | Darija MMLU | Darija HellaSwag | GSM8K @5 | HellaSwag (EN) | | ---------------- | ----------- | ---------------- | ---------- | -------------- | | Atlas‑Chat‑27B | **61.9 %** | 48.4 % | 82.0 % | 77.8 % | | **GemMaroc‑27B** | 61.6 % | **60.5 %** | **84.2 %** | **79.3 %** | Zero‑shot accuracy; full table in the paper. --- ## Quick start ```python from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline model_id = "AbderrahmanSkiredj1/GemMaroc-27b-it" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto" ) pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, device_map="auto", max_new_tokens=1024, temperature=0.7, repetition_penalty=1.2, no_repeat_ngram_size=3, ) messages = [ {"role": "user", "content": "شنو هي نظرية ‘butterfly effect’؟ فسّرها بدارجة ونقّط مثال بسيط."} ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) print(pipe(prompt)[0]["generated_text"][len(prompt):]) ``` ### Chat template (Gemma 3 format) The tokenizer provides a baked‑in Jinja template that starts with a **begin‑of‑sequence** token (``), then alternates user/model turns, each wrapped by `` … `` markers. When you set `add_generation_prompt=True` it ends after the opening model tag so the model can continue: ``` user {user message} model ``` The assistant will keep generating tokens until it decides to emit ``. ```python prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) ``` No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically. --- Pre‑quantised checkpoints will be published under the same repo tags (`gemmaroc‑27b‑awq‑int4`, `gemmaroc‑27b‑gguf‑q4_k_m`). --- ## Training recipe (one‑paragraph recap) 1. **Data** Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross‑lingual robustness. 2. **LoRA SFT** Rank 16, α = 32, 3 epochs, bf16, context 2 048. 3. **Merge & push** Merge LoRA into base weights (`peft.merge_and_unload`), convert to safetensors, upload. --- ## Limitations & ethical considerations * Sentiment and abstractive summarisation still trail state‑of‑the‑art. * Tokeniser is unchanged; rare Darija spellings may fragment. * Model may inherit societal biases present in pre‑training data. * No RLHF / RLAIF safety alignment yet – apply a moderation layer in production. --- ## Citation If you use GemMaroc in your work, please cite: ```bibtex @misc{skiredj2025gemmarocunlockingdarijaproficiency, title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data}, author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada}, year={2025}, eprint={2505.17082}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.17082}, } ```