mohamed2811's picture
Update README.md
8e53744 verified
---
language:
- ar
base_model:
- sayed0am/arabic-english-bge-m3
tags:
- sentence-similarity
- sentence-transformers
datasets:
- castorini/mr-tydi
- hsseinmz/arcd
- Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset
- arbml/Arabic_RC
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/662294730e805d4fcb06a892/n3whDLHDmEAhbFgYCbhRj.png)
# ๐Ÿง  Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval
[Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) This is the second version of the [Muffakir_Embedding model](https://huggingface.co/mohamed2811/Muffakir_Embedding).
It shows strong performance in **Arabic retrieval-augmented generation (RAG)** and dense retrieval tasks.
We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. ๐Ÿš€
---
## ๐Ÿ” Model Overview
* ๐Ÿงฌ **Base model**: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3)
* ๐Ÿ“š **Fine-tuning dataset**: \~70,000 Arabic sentence pairs from various topics
* ๐Ÿซ **20K** curated from Egyptian legal books
* ๐ŸŒ **50K** collected from Hugging Face datasets (multi-domain)
* ๐Ÿ‹๏ธ **Training epochs**: 3
* ๐Ÿ“ **Embedding dimension**: 1024
* ๐Ÿ”— **Loss functions**:
* [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss)
* [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings
---
## ๐ŸŒŸ Key Features
* ๐Ÿฅ‡ **Strong performance** in **Arabic RAG** and dense retrieval tasks
* ๐ŸŽฏ **Multi-resolution embeddings** via Matryoshka (dims: `1024 โ†’ 64`)
* ๐ŸŒ Supports **(Arabic)** encoding
* ๐Ÿ“ฆ Ready for use in real-world search, Q\&A, and AI agent systems
---
## โš™๏ธ Training Details
* ๐Ÿงพ **Dataset size**: 70K examples
* ๐Ÿ—‚๏ธ **Topics**: Multi-domain (educational, legal, general knowledge, etc.)
* ๐Ÿ” **Epochs**: 3
* ๐Ÿงช **Batch size**: 8 (gradient accumulation enabled)
* ๐Ÿš€ **Learning rate**: 2e-5
* ๐Ÿงฐ **Framework**: [sentence-transformers](https://www.sbert.net)
---
## ๐Ÿ“€ Model Specs
* ๐Ÿ”ข Embedding size: `1024`
* ๐Ÿ”„ Supports Matryoshka-style dimension truncation
* ๐Ÿง  Bi-encoder setup, ideal for fast and scalable retrieval tasks
---
---
## ๐Ÿ† Leaderboard Performance
* The **Muffakir\_Embedding\_V2** model has achieved notable rankings on the [Arabic RAG Leaderboard](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard), securing:
* **5th place** in the **Retrieval** category
* These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.
---
## ๐Ÿงช Example Usage
```python
from sentence_transformers import SentenceTransformer
import torch
# Load the fine-tuned Muffakir model
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")
# Example query and candidate passages
query = "ู…ุง ู‡ูŠ ุดุฑูˆุท ุตุญุฉ ุงู„ุนู‚ุฏุŸ"
passages = [
"ูŠุดุชุฑุท ุงู„ุชุฑุงุถูŠ ู„ุตุญุฉ ุงู„ุนู‚ุฏ.",
"ูŠู†ู‚ุณู… ุงู„ู‚ุงู†ูˆู† ุฅู„ู‰ ุนุงู… ูˆุฎุงุต.",
"ุงู„ุนู‚ุฏ ุดุฑูŠุนุฉ ุงู„ู…ุชุนุงู‚ุฏูŠู†.",
"ุชู†ุชู‡ูŠ ุงู„ูˆู„ุงูŠุฉ ุงู„ู‚ุงู†ูˆู†ูŠุฉ ุจุจู„ูˆุบ ุณู† ุงู„ุฑุดุฏ."
]
# Encode query and passages
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)
# Compute cosine similarities
cosine_scores = torch.matmul(embedding_query, embedding_passages.T)
# Get best matching passage
best_idx = cosine_scores.argmax().item()
best_passage = passages[best_idx]
print(f"๐Ÿ” Best matching passage: {best_passage}")
```
```python
@misc{muffakir2025,
author = {Mohamed Khaled},
title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
year = {2025},
howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
}
```
---