image/png

๐Ÿง  Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval

Muffakir This is the second version of the Muffakir_Embedding model. It shows strong performance in Arabic retrieval-augmented generation (RAG) and dense retrieval tasks. We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. ๐Ÿš€


๐Ÿ” Model Overview

  • ๐Ÿงฌ Base model: sayed0am/arabic-english-bge-m3

  • ๐Ÿ“š Fine-tuning dataset: ~70,000 Arabic sentence pairs from various topics

    • ๐Ÿซ 20K curated from Egyptian legal books
    • ๐ŸŒ 50K collected from Hugging Face datasets (multi-domain)
  • ๐Ÿ‹๏ธ Training epochs: 3

  • ๐Ÿ“ Embedding dimension: 1024

  • ๐Ÿ”— Loss functions:


๐ŸŒŸ Key Features

  • ๐Ÿฅ‡ Strong performance in Arabic RAG and dense retrieval tasks
  • ๐ŸŽฏ Multi-resolution embeddings via Matryoshka (dims: 1024 โ†’ 64)
  • ๐ŸŒ Supports (Arabic) encoding
  • ๐Ÿ“ฆ Ready for use in real-world search, Q&A, and AI agent systems

โš™๏ธ Training Details

  • ๐Ÿงพ Dataset size: 70K examples
  • ๐Ÿ—‚๏ธ Topics: Multi-domain (educational, legal, general knowledge, etc.)
  • ๐Ÿ” Epochs: 3
  • ๐Ÿงช Batch size: 8 (gradient accumulation enabled)
  • ๐Ÿš€ Learning rate: 2e-5
  • ๐Ÿงฐ Framework: sentence-transformers

๐Ÿ“€ Model Specs

  • ๐Ÿ”ข Embedding size: 1024
  • ๐Ÿ”„ Supports Matryoshka-style dimension truncation
  • ๐Ÿง  Bi-encoder setup, ideal for fast and scalable retrieval tasks


๐Ÿ† Leaderboard Performance

  • The Muffakir_Embedding_V2 model has achieved notable rankings on the Arabic RAG Leaderboard, securing:

  • 5th place in the Retrieval category

  • These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.


๐Ÿงช Example Usage

from sentence_transformers import SentenceTransformer
import torch

# Load the fine-tuned Muffakir model
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")

# Example query and candidate passages
query = "ู…ุง ู‡ูŠ ุดุฑูˆุท ุตุญุฉ ุงู„ุนู‚ุฏุŸ"
passages = [
    "ูŠุดุชุฑุท ุงู„ุชุฑุงุถูŠ ู„ุตุญุฉ ุงู„ุนู‚ุฏ.",
    "ูŠู†ู‚ุณู… ุงู„ู‚ุงู†ูˆู† ุฅู„ู‰ ุนุงู… ูˆุฎุงุต.",
    "ุงู„ุนู‚ุฏ ุดุฑูŠุนุฉ ุงู„ู…ุชุนุงู‚ุฏูŠู†.",
    "ุชู†ุชู‡ูŠ ุงู„ูˆู„ุงูŠุฉ ุงู„ู‚ุงู†ูˆู†ูŠุฉ ุจุจู„ูˆุบ ุณู† ุงู„ุฑุดุฏ."
]

# Encode query and passages
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)

# Compute cosine similarities
cosine_scores = torch.matmul(embedding_query, embedding_passages.T)

# Get best matching passage
best_idx = cosine_scores.argmax().item()
best_passage = passages[best_idx]

print(f"๐Ÿ” Best matching passage: {best_passage}")
@misc{muffakir2025,
  author = {Mohamed Khaled},
  title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
  year = {2025},
  howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
}

Downloads last month
44
Safetensors
Model size
362M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mohamed2811/Muffakir_Embedding_V2

Base model

BAAI/bge-m3
Finetuned
(1)
this model

Datasets used to train mohamed2811/Muffakir_Embedding_V2