๐ง Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval
Muffakir This is the second version of the Muffakir_Embedding model. It shows strong performance in Arabic retrieval-augmented generation (RAG) and dense retrieval tasks. We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. ๐
๐ Model Overview
๐งฌ Base model:
sayed0am/arabic-english-bge-m3
๐ Fine-tuning dataset: ~70,000 Arabic sentence pairs from various topics
- ๐ซ 20K curated from Egyptian legal books
- ๐ 50K collected from Hugging Face datasets (multi-domain)
๐๏ธ Training epochs: 3
๐ Embedding dimension: 1024
๐ Loss functions:
MultipleNegativesRankingLoss
MatryoshkaLoss
for multi-resolution embeddings
๐ Key Features
- ๐ฅ Strong performance in Arabic RAG and dense retrieval tasks
- ๐ฏ Multi-resolution embeddings via Matryoshka (dims:
1024 โ 64
) - ๐ Supports (Arabic) encoding
- ๐ฆ Ready for use in real-world search, Q&A, and AI agent systems
โ๏ธ Training Details
- ๐งพ Dataset size: 70K examples
- ๐๏ธ Topics: Multi-domain (educational, legal, general knowledge, etc.)
- ๐ Epochs: 3
- ๐งช Batch size: 8 (gradient accumulation enabled)
- ๐ Learning rate: 2e-5
- ๐งฐ Framework: sentence-transformers
๐ Model Specs
- ๐ข Embedding size:
1024
- ๐ Supports Matryoshka-style dimension truncation
- ๐ง Bi-encoder setup, ideal for fast and scalable retrieval tasks
๐ Leaderboard Performance
The Muffakir_Embedding_V2 model has achieved notable rankings on the Arabic RAG Leaderboard, securing:
5th place in the Retrieval category
These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.
๐งช Example Usage
from sentence_transformers import SentenceTransformer
import torch
# Load the fine-tuned Muffakir model
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")
# Example query and candidate passages
query = "ู
ุง ูู ุดุฑูุท ุตุญุฉ ุงูุนูุฏุ"
passages = [
"ูุดุชุฑุท ุงูุชุฑุงุถู ูุตุญุฉ ุงูุนูุฏ.",
"ูููุณู
ุงููุงููู ุฅูู ุนุงู
ูุฎุงุต.",
"ุงูุนูุฏ ุดุฑูุนุฉ ุงูู
ุชุนุงูุฏูู.",
"ุชูุชูู ุงูููุงูุฉ ุงููุงููููุฉ ุจุจููุบ ุณู ุงูุฑุดุฏ."
]
# Encode query and passages
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)
# Compute cosine similarities
cosine_scores = torch.matmul(embedding_query, embedding_passages.T)
# Get best matching passage
best_idx = cosine_scores.argmax().item()
best_passage = passages[best_idx]
print(f"๐ Best matching passage: {best_passage}")
@misc{muffakir2025,
author = {Mohamed Khaled},
title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
year = {2025},
howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
}
- Downloads last month
- 44