Update README.md

8e53744 verified 14 days ago

4.09 kB

	---
	language:
	- ar
	base_model:
	- sayed0am/arabic-english-bge-m3
	tags:
	- sentence-similarity
	- sentence-transformers
	datasets:
	- castorini/mr-tydi
	- hsseinmz/arcd
	- Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset
	- arbml/Arabic_RC
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/662294730e805d4fcb06a892/n3whDLHDmEAhbFgYCbhRj.png)

	# 🧠 Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval

	[Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) This is the second version of the [Muffakir_Embedding model](https://huggingface.co/mohamed2811/Muffakir_Embedding).
	It shows strong performance in Arabic retrieval-augmented generation (RAG) and dense retrieval tasks.
	We plan to release a series of models focused on different topics and domains to further enhance Arabic information retrieval. 🚀

	---

	## 🔍 Model Overview

	* 🧬 Base model: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3)
	* 📚 Fine-tuning dataset: \~70,000 Arabic sentence pairs from various topics

	* 🏫 20K curated from Egyptian legal books
	* 🌐 50K collected from Hugging Face datasets (multi-domain)
	* 🏋️ Training epochs: 3
	* 📏 Embedding dimension: 1024
	* 🔗 Loss functions:

	* [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss)
	* [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings

	---

	## 🌟 Key Features

	* 🥇 Strong performance in Arabic RAG and dense retrieval tasks
	* 🎯 Multi-resolution embeddings via Matryoshka (dims: `1024 → 64`)
	* 🌍 Supports (Arabic) encoding
	* 📦 Ready for use in real-world search, Q\&A, and AI agent systems

	---

	## ⚙️ Training Details

	* 🧾 Dataset size: 70K examples
	* 🗂️ Topics: Multi-domain (educational, legal, general knowledge, etc.)
	* 🔁 Epochs: 3
	* 🧪 Batch size: 8 (gradient accumulation enabled)
	* 🚀 Learning rate: 2e-5
	* 🧰 Framework: [sentence-transformers](https://www.sbert.net)

	---

	## 📀 Model Specs

	* 🔢 Embedding size: `1024`
	* 🔄 Supports Matryoshka-style dimension truncation
	* 🧠 Bi-encoder setup, ideal for fast and scalable retrieval tasks

	---

	---

	## 🏆 Leaderboard Performance

	* The Muffakir\_Embedding\_V2 model has achieved notable rankings on the [Arabic RAG Leaderboard](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard), securing:

	* 5th place in the Retrieval category

	* These results underscore the model's effectiveness in both retrieving relevant information and accurately ranking it within Arabic Retrieval-Augmented Generation (RAG) systems.

	---

	## 🧪 Example Usage

	```python
	from sentence_transformers import SentenceTransformer
	import torch

	# Load the fine-tuned Muffakir model
	model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")

	# Example query and candidate passages
	query = "ما هي شروط صحة العقد؟"
	passages = [
	"يشترط التراضي لصحة العقد.",
	"ينقسم القانون إلى عام وخاص.",
	"العقد شريعة المتعاقدين.",
	"تنتهي الولاية القانونية ببلوغ سن الرشد."
	]

	# Encode query and passages
	embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
	embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)

	# Compute cosine similarities
	cosine_scores = torch.matmul(embedding_query, embedding_passages.T)

	# Get best matching passage
	best_idx = cosine_scores.argmax().item()
	best_passage = passages[best_idx]

	print(f"🔍 Best matching passage: {best_passage}")
	```


	```python
	@misc{muffakir2025,
	author = {Mohamed Khaled},
	title = {Muffakir: State-of-the-art Arabic-English Bi-Encoder for Dense Retrieval},
	year = {2025},
	howpublished = {\url{https://huggingface.co/your-username/Muffakir-embeddings-v2}},
	}
	```


	---