AraGemma-Embedding-300m
Model Page: AraGemma-Embedding (Hugging Face)
Authors: Google DeepMind (base model), fine-tuned by Omartificial-Intelligence-Space
Find More About: Arabic Semantic Embeddings Models
Simple RAG and Other NLP Tasks Example:
Model Overview
AraGemma-Embedding-300m is a fine-tuned version of EmbeddingGemma-300M, optimized for Arabic semantic understanding.
This model was fine-tuned using 1 million Arabic triplet pairs (anchor, positive, negative) with Matryoshka Representation Learning (MRL) to enhance semantic similarity, clustering, classification, and retrieval for Arabic texts.
It builds on Google’s Gemma 3 research, making it lightweight, efficient, and deployable on-device (mobile, laptops, desktops) while achieving state-of-the-art Arabic semantic embedding performance.
Model Information
Input
- Text string (Arabic or multilingual)
- Maximum context length: 2048 tokens
Output
- Dense vector representation of size 768
- Supports MRL truncation to 512, 256, or 128 dimensions with re-normalization
Performance
Benchmark Results
Significant improvements show stronger semantic Arabic understanding.
Performance with other Arabic Embeddings
Model | Dim | # Params. | STS17 | STS22-v2 | Average |
---|---|---|---|---|---|
Arabic-Triplet-Matryoshka-V2 | 768 | 135M | 85 | 64 | 75 |
Arabert-all-nli-triplet-Matryoshka | 768 | 135M | 83 | 64 | 74 |
AraGemma-Embedding-300m | 768 | 303M | 84 | 62 | 73 |
GATE-AraBert-V1 | 767 | 135M | 83 | 63 | 73 |
Marbert-all-nli-triplet-Matryoshka | 768 | 163M | 82 | 61 | 72 |
Arabic-labse-Matryoshka | 768 | 471M | 82 | 61 | 72 |
AraEuroBert-Small | 768 | 210M | 80 | 61 | 71 |
E5-all-nli-triplet-Matryoshka | 384 | 278M | 80 | 60 | 70 |
text-embedding-3-large | 3072 | - | 81 | 59 | 70 |
Arabic-all-nli-triplet-Matryoshka | 768 | 135M | 82 | 54 | 68 |
AraEuroBert-Mid | 1151 | 610M | 83 | 53 | 68 |
paraphrase-multilingual-mpnet-base-v2 | 768 | 135M | 79 | 55 | 67 |
AraEuroBert-Large | 2304 | 2.1B | 79 | 55 | 67 |
text-embedding-ada-002 | 1536 | - | 71 | 62 | 66 |
text-embedding-3-small | 1536 | - | 72 | 57 | 65 |
Usage
This model is compatible with Sentence Transformers and Hugging Face Transformers.
from sentence_transformers import SentenceTransformer
# Load the Arabic-optimized embedding model
model = SentenceTransformer("Omartificial-Intelligence-Space/AraGemma-Embedding-300m")
# Example: Arabic semantic similarity
query = "ما هو الكوكب الأحمر؟"
documents = [
"الزهرة تشبه الأرض في الحجم والقرب.",
"المريخ يسمى بالكوكب الأحمر بسبب لونه المميز.",
"المشتري أكبر كواكب المجموعة الشمسية.",
"زحل يتميز بحلقاته الشهيرة."
]
query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)
# Compute cosine similarities
from torch import cosine_similarity
import torch
query_tensor = torch.tensor(query_embedding)
doc_tensors = torch.tensor(doc_embeddings)
similarities = cosine_similarity(query_tensor.unsqueeze(0), doc_tensors)
print(similarities)
Applications
- Semantic Chunking for RAG (Retrieval-Augmented Generation)
- Semantic Search & Retrieval (Arabic focus)
- Clustering and Classification of Arabic documents
- Cross-lingual retrieval (multilingual data supported)
Limitations
- Embedding activations do not support float16 – use float32 or bfloat16.
Citation
If you use this model in your work, please cite:
@misc{AraGemmaEmbedding2025,
title={AraGemma-Embedding: Fine-tuned EmbeddingGemma for Arabic Semantic Understanding},
author={Omartificial-Intelligence-Space},
year={2025},
url={https://huggingface.co/Omartificial-Intelligence-Space/AraGemma-Embedding-300m}
}
- Downloads last month
- 3
Model tree for Omartificial-Intelligence-Space/AraGemma-Embedding-300m
Base model
google/embeddinggemma-300m