File size: 4,089 Bytes
0daa50c b040af7 0daa50c 7a9d999 5fb7f46 d4f4979 5fb7f46 7a9d999 5fb7f46 7a9d999 5fb7f46 8e53744 5fb7f46 7b1caea 6d4b813 7b1caea 6d4b813 7b1caea 5fb7f46 182f5fa 9c60ba2 182f5fa 9c60ba2 182f5fa fe0431f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
language:
- ar
base_model:
- sayed0am/arabic-english-bge-m3
tags:
- sentence-similarity
- sentence-transformers
datasets:
- castorini/mr-tydi
- hsseinmz/arcd
- Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset
- arbml/Arabic_RC
---

# ๐ง Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval
[Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) This is the second version of the [Muffakir_Embedding model](https://huggingface.co/mohamed2811/Muffakir_Embedding).
It shows strong performance in **Arabic retrieval-augmented generation (RAG)** and dense retrieval tasks.
We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. ๐
---
## ๐ Model Overview
* ๐งฌ **Base model**: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3)
* ๐ **Fine-tuning dataset**: \~70,000 Arabic sentence pairs from various topics
* ๐ซ **20K** curated from Egyptian legal books
* ๐ **50K** collected from Hugging Face datasets (multi-domain)
* ๐๏ธ **Training epochs**: 3
* ๐ **Embedding dimension**: 1024
* ๐ **Loss functions**:
* [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss)
* [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings
---
## ๐ Key Features
* ๐ฅ **Strong performance** in **Arabic RAG** and dense retrieval tasks
* ๐ฏ **Multi-resolution embeddings** via Matryoshka (dims: `1024 โ 64`)
* ๐ Supports **(Arabic)** encoding
* ๐ฆ Ready for use in real-world search, Q\&A, and AI agent systems
---
## โ๏ธ Training Details
* ๐งพ **Dataset size**: 70K examples
* ๐๏ธ **Topics**: Multi-domain (educational, legal, general knowledge, etc.)
* ๐ **Epochs**: 3
* ๐งช **Batch size**: 8 (gradient accumulation enabled)
* ๐ **Learning rate**: 2e-5
* ๐งฐ **Framework**: [sentence-transformers](https://www.sbert.net)
---
## ๐ Model Specs
* ๐ข Embedding size: `1024`
* ๐ Supports Matryoshka-style dimension truncation
* ๐ง Bi-encoder setup, ideal for fast and scalable retrieval tasks
---
---
## ๐ Leaderboard Performance
* The **Muffakir\_Embedding\_V2** model has achieved notable rankings on the [Arabic RAG Leaderboard](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard), securing:
* **5th place** in the **Retrieval** category
* These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.
---
## ๐งช Example Usage
```python
from sentence_transformers import SentenceTransformer
import torch
# Load the fine-tuned Muffakir model
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")
# Example query and candidate passages
query = "ู
ุง ูู ุดุฑูุท ุตุญุฉ ุงูุนูุฏุ"
passages = [
"ูุดุชุฑุท ุงูุชุฑุงุถู ูุตุญุฉ ุงูุนูุฏ.",
"ูููุณู
ุงููุงููู ุฅูู ุนุงู
ูุฎุงุต.",
"ุงูุนูุฏ ุดุฑูุนุฉ ุงูู
ุชุนุงูุฏูู.",
"ุชูุชูู ุงูููุงูุฉ ุงููุงููููุฉ ุจุจููุบ ุณู ุงูุฑุดุฏ."
]
# Encode query and passages
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)
# Compute cosine similarities
cosine_scores = torch.matmul(embedding_query, embedding_passages.T)
# Get best matching passage
best_idx = cosine_scores.argmax().item()
best_passage = passages[best_idx]
print(f"๐ Best matching passage: {best_passage}")
```
```python
@misc{muffakir2025,
author = {Mohamed Khaled},
title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
year = {2025},
howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
}
```
--- |