File size: 4,089 Bytes

---
language:
- ar
base_model:
- sayed0am/arabic-english-bge-m3
tags:
- sentence-similarity
- sentence-transformers
datasets:
- castorini/mr-tydi
- hsseinmz/arcd
- Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset
- arbml/Arabic_RC
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/662294730e805d4fcb06a892/n3whDLHDmEAhbFgYCbhRj.png)

# 🧠 Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval

[Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) This is the second version of the [Muffakir_Embedding model](https://huggingface.co/mohamed2811/Muffakir_Embedding).
It shows strong performance in **Arabic retrieval-augmented generation (RAG)** and dense retrieval tasks.
We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. 🚀

---

## 🔍 Model Overview

* 🧬 **Base model**: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3)
* 📚 **Fine-tuning dataset**: \~70,000 Arabic sentence pairs from various topics

  * 🏫 **20K** curated from Egyptian legal books
  * 🌐 **50K** collected from Hugging Face datasets (multi-domain)
* 🏋️ **Training epochs**: 3
* 📏 **Embedding dimension**: 1024
* 🔗 **Loss functions**:

  * [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss)
  * [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings

---

## 🌟 Key Features

* 🥇 **Strong performance** in **Arabic RAG** and dense retrieval tasks
* 🎯 **Multi-resolution embeddings** via Matryoshka (dims: `1024 → 64`)
* 🌍 Supports **(Arabic)** encoding
* 📦 Ready for use in real-world search, Q\&A, and AI agent systems

---

## ⚙️ Training Details

* 🧾 **Dataset size**: 70K examples
* 🗂️ **Topics**: Multi-domain (educational, legal, general knowledge, etc.)
* 🔁 **Epochs**: 3
* 🧪 **Batch size**: 8 (gradient accumulation enabled)
* 🚀 **Learning rate**: 2e-5
* 🧰 **Framework**: [sentence-transformers](https://www.sbert.net)

---

## 📀 Model Specs

* 🔢 Embedding size: `1024`
* 🔄 Supports Matryoshka-style dimension truncation
* 🧠 Bi-encoder setup, ideal for fast and scalable retrieval tasks

---

---

## 🏆 Leaderboard Performance

* The **Muffakir\_Embedding\_V2** model has achieved notable rankings on the [Arabic RAG Leaderboard](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard), securing:

* **5th place** in the **Retrieval** category

* These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.

---

## 🧪 Example Usage

```python
from sentence_transformers import SentenceTransformer
import torch

# Load the fine-tuned Muffakir model
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")

# Example query and candidate passages
query = "ما هي شروط صحة العقد؟"
passages = [
    "يشترط التراضي لصحة العقد.",
    "ينقسم القانون إلى عام وخاص.",
    "العقد شريعة المتعاقدين.",
    "تنتهي الولاية القانونية ببلوغ سن الرشد."
]

# Encode query and passages
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)

# Compute cosine similarities
cosine_scores = torch.matmul(embedding_query, embedding_passages.T)

# Get best matching passage
best_idx = cosine_scores.argmax().item()
best_passage = passages[best_idx]

print(f"🔍 Best matching passage: {best_passage}")
```


```python
@misc{muffakir2025,
  author = {Mohamed Khaled},
  title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
  year = {2025},
  howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
}
```


---