File size: 4,089 Bytes
0daa50c
 
 
 
 
 
b040af7
0daa50c
7a9d999
 
 
 
 
5fb7f46
 
d4f4979
5fb7f46
 
 
7a9d999
 
 
5fb7f46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a9d999
5fb7f46
8e53744
5fb7f46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b1caea
 
 
 
6d4b813
7b1caea
 
 
6d4b813
7b1caea
 
 
5fb7f46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182f5fa
9c60ba2
182f5fa
 
 
 
 
 
9c60ba2
182f5fa
 
fe0431f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
language:
- ar
base_model:
- sayed0am/arabic-english-bge-m3
tags:
- sentence-similarity
- sentence-transformers
datasets:
- castorini/mr-tydi
- hsseinmz/arcd
- Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset
- arbml/Arabic_RC
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/662294730e805d4fcb06a892/n3whDLHDmEAhbFgYCbhRj.png)

# ๐Ÿง  Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval

[Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) This is the second version of the [Muffakir_Embedding model](https://huggingface.co/mohamed2811/Muffakir_Embedding).
It shows strong performance in **Arabic retrieval-augmented generation (RAG)** and dense retrieval tasks.
We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. ๐Ÿš€

---

## ๐Ÿ” Model Overview

* ๐Ÿงฌ **Base model**: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3)
* ๐Ÿ“š **Fine-tuning dataset**: \~70,000 Arabic sentence pairs from various topics

  * ๐Ÿซ **20K** curated from Egyptian legal books
  * ๐ŸŒ **50K** collected from Hugging Face datasets (multi-domain)
* ๐Ÿ‹๏ธ **Training epochs**: 3
* ๐Ÿ“ **Embedding dimension**: 1024
* ๐Ÿ”— **Loss functions**:

  * [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss)
  * [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings

---

## ๐ŸŒŸ Key Features

* ๐Ÿฅ‡ **Strong performance** in **Arabic RAG** and dense retrieval tasks
* ๐ŸŽฏ **Multi-resolution embeddings** via Matryoshka (dims: `1024 โ†’ 64`)
* ๐ŸŒ Supports **(Arabic)** encoding
* ๐Ÿ“ฆ Ready for use in real-world search, Q\&A, and AI agent systems

---

## โš™๏ธ Training Details

* ๐Ÿงพ **Dataset size**: 70K examples
* ๐Ÿ—‚๏ธ **Topics**: Multi-domain (educational, legal, general knowledge, etc.)
* ๐Ÿ” **Epochs**: 3
* ๐Ÿงช **Batch size**: 8 (gradient accumulation enabled)
* ๐Ÿš€ **Learning rate**: 2e-5
* ๐Ÿงฐ **Framework**: [sentence-transformers](https://www.sbert.net)

---

## ๐Ÿ“€ Model Specs

* ๐Ÿ”ข Embedding size: `1024`
* ๐Ÿ”„ Supports Matryoshka-style dimension truncation
* ๐Ÿง  Bi-encoder setup, ideal for fast and scalable retrieval tasks

---

---

## ๐Ÿ† Leaderboard Performance

* The **Muffakir\_Embedding\_V2** model has achieved notable rankings on the [Arabic RAG Leaderboard](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard), securing:

* **5th place** in the **Retrieval** category

* These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.

---

## ๐Ÿงช Example Usage

```python
from sentence_transformers import SentenceTransformer
import torch

# Load the fine-tuned Muffakir model
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")

# Example query and candidate passages
query = "ู…ุง ู‡ูŠ ุดุฑูˆุท ุตุญุฉ ุงู„ุนู‚ุฏุŸ"
passages = [
    "ูŠุดุชุฑุท ุงู„ุชุฑุงุถูŠ ู„ุตุญุฉ ุงู„ุนู‚ุฏ.",
    "ูŠู†ู‚ุณู… ุงู„ู‚ุงู†ูˆู† ุฅู„ู‰ ุนุงู… ูˆุฎุงุต.",
    "ุงู„ุนู‚ุฏ ุดุฑูŠุนุฉ ุงู„ู…ุชุนุงู‚ุฏูŠู†.",
    "ุชู†ุชู‡ูŠ ุงู„ูˆู„ุงูŠุฉ ุงู„ู‚ุงู†ูˆู†ูŠุฉ ุจุจู„ูˆุบ ุณู† ุงู„ุฑุดุฏ."
]

# Encode query and passages
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)

# Compute cosine similarities
cosine_scores = torch.matmul(embedding_query, embedding_passages.T)

# Get best matching passage
best_idx = cosine_scores.argmax().item()
best_passage = passages[best_idx]

print(f"๐Ÿ” Best matching passage: {best_passage}")
```


```python
@misc{muffakir2025,
  author = {Mohamed Khaled},
  title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
  year = {2025},
  howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
}
```


---