🧠 Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval

Muffakir This is the second version of the Muffakir_Embedding model. It shows strong performance in Arabic retrieval-augmented generation (RAG) and dense retrieval tasks. We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. 🚀

🔍 Model Overview

🧬 Base model: sayed0am/arabic-english-bge-m3
📚 Fine-tuning dataset: ~70,000 Arabic sentence pairs from various topics
- 🏫 20K curated from Egyptian legal books
- 🌐 50K collected from Hugging Face datasets (multi-domain)
🏋️ Training epochs: 3
📏 Embedding dimension: 1024
🔗 Loss functions:
- MultipleNegativesRankingLoss
- MatryoshkaLoss for multi-resolution embeddings

🌟 Key Features

🥇 Strong performance in Arabic RAG and dense retrieval tasks
🎯 Multi-resolution embeddings via Matryoshka (dims: 1024 → 64)
🌍 Supports (Arabic) encoding
📦 Ready for use in real-world search, Q&A, and AI agent systems

⚙️ Training Details

🧾 Dataset size: 70K examples
🗂️ Topics: Multi-domain (educational, legal, general knowledge, etc.)
🔁 Epochs: 3
🧪 Batch size: 8 (gradient accumulation enabled)
🚀 Learning rate: 2e-5
🧰 Framework: sentence-transformers

📀 Model Specs

🔢 Embedding size: 1024
🔄 Supports Matryoshka-style dimension truncation
🧠 Bi-encoder setup, ideal for fast and scalable retrieval tasks

🏆 Leaderboard Performance

The Muffakir_Embedding_V2 model has achieved notable rankings on the Arabic RAG Leaderboard, securing:
5th place in the Retrieval category
These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.

🧪 Example Usage

from sentence_transformers import SentenceTransformer
import torch

# Load the fine-tuned Muffakir model
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")

# Example query and candidate passages
query = "ما هي شروط صحة العقد؟"
passages = [
    "يشترط التراضي لصحة العقد.",
    "ينقسم القانون إلى عام وخاص.",
    "العقد شريعة المتعاقدين.",
    "تنتهي الولاية القانونية ببلوغ سن الرشد."
]

# Encode query and passages
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)

# Compute cosine similarities
cosine_scores = torch.matmul(embedding_query, embedding_passages.T)

# Get best matching passage
best_idx = cosine_scores.argmax().item()
best_passage = passages[best_idx]

print(f"🔍 Best matching passage: {best_passage}")

@misc{muffakir2025,
  author = {Mohamed Khaled},
  title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
  year = {2025},
  howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
}

mohamed2811
/

Muffakir_Embedding_V2