|
--- |
|
language: |
|
- ar |
|
base_model: |
|
- sayed0am/arabic-english-bge-m3 |
|
tags: |
|
- sentence-similarity |
|
- sentence-transformers |
|
datasets: |
|
- castorini/mr-tydi |
|
- hsseinmz/arcd |
|
- Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset |
|
- arbml/Arabic_RC |
|
--- |
|
|
|
 |
|
|
|
# ๐ง Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval |
|
|
|
[Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) This is the second version of the [Muffakir_Embedding model](https://huggingface.co/mohamed2811/Muffakir_Embedding). |
|
It shows strong performance in **Arabic retrieval-augmented generation (RAG)** and dense retrieval tasks. |
|
We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. ๐ |
|
|
|
--- |
|
|
|
## ๐ Model Overview |
|
|
|
* ๐งฌ **Base model**: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3) |
|
* ๐ **Fine-tuning dataset**: \~70,000 Arabic sentence pairs from various topics |
|
|
|
* ๐ซ **20K** curated from Egyptian legal books |
|
* ๐ **50K** collected from Hugging Face datasets (multi-domain) |
|
* ๐๏ธ **Training epochs**: 3 |
|
* ๐ **Embedding dimension**: 1024 |
|
* ๐ **Loss functions**: |
|
|
|
* [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) |
|
* [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings |
|
|
|
--- |
|
|
|
## ๐ Key Features |
|
|
|
* ๐ฅ **Strong performance** in **Arabic RAG** and dense retrieval tasks |
|
* ๐ฏ **Multi-resolution embeddings** via Matryoshka (dims: `1024 โ 64`) |
|
* ๐ Supports **(Arabic)** encoding |
|
* ๐ฆ Ready for use in real-world search, Q\&A, and AI agent systems |
|
|
|
--- |
|
|
|
## โ๏ธ Training Details |
|
|
|
* ๐งพ **Dataset size**: 70K examples |
|
* ๐๏ธ **Topics**: Multi-domain (educational, legal, general knowledge, etc.) |
|
* ๐ **Epochs**: 3 |
|
* ๐งช **Batch size**: 8 (gradient accumulation enabled) |
|
* ๐ **Learning rate**: 2e-5 |
|
* ๐งฐ **Framework**: [sentence-transformers](https://www.sbert.net) |
|
|
|
--- |
|
|
|
## ๐ Model Specs |
|
|
|
* ๐ข Embedding size: `1024` |
|
* ๐ Supports Matryoshka-style dimension truncation |
|
* ๐ง Bi-encoder setup, ideal for fast and scalable retrieval tasks |
|
|
|
--- |
|
|
|
--- |
|
|
|
## ๐ Leaderboard Performance |
|
|
|
* The **Muffakir\_Embedding\_V2** model has achieved notable rankings on the [Arabic RAG Leaderboard](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard), securing: |
|
|
|
* **5th place** in the **Retrieval** category |
|
|
|
* These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems. |
|
|
|
--- |
|
|
|
## ๐งช Example Usage |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
import torch |
|
|
|
# Load the fine-tuned Muffakir model |
|
model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2") |
|
|
|
# Example query and candidate passages |
|
query = "ู
ุง ูู ุดุฑูุท ุตุญุฉ ุงูุนูุฏุ" |
|
passages = [ |
|
"ูุดุชุฑุท ุงูุชุฑุงุถู ูุตุญุฉ ุงูุนูุฏ.", |
|
"ูููุณู
ุงููุงููู ุฅูู ุนุงู
ูุฎุงุต.", |
|
"ุงูุนูุฏ ุดุฑูุนุฉ ุงูู
ุชุนุงูุฏูู.", |
|
"ุชูุชูู ุงูููุงูุฉ ุงููุงููููุฉ ุจุจููุบ ุณู ุงูุฑุดุฏ." |
|
] |
|
|
|
# Encode query and passages |
|
embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True) |
|
embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True) |
|
|
|
# Compute cosine similarities |
|
cosine_scores = torch.matmul(embedding_query, embedding_passages.T) |
|
|
|
# Get best matching passage |
|
best_idx = cosine_scores.argmax().item() |
|
best_passage = passages[best_idx] |
|
|
|
print(f"๐ Best matching passage: {best_passage}") |
|
``` |
|
|
|
|
|
```python |
|
@misc{muffakir2025, |
|
author = {Mohamed Khaled}, |
|
title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval}, |
|
year = {2025}, |
|
howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}}, |
|
} |
|
``` |
|
|
|
|
|
--- |