Overview
BTE-Base-Ar is a leading open-source model based on the Transformer architecture, sIt maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. With only 149 million parameters, it offers a perfect balance between performance and efficiency, outperforming larger models while using significantly fewer resources.
Model Details
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Language: ar
- License: mit
Key Features
- Lightweight & Efficient: 149M parameters vs competitors with 278-568M parameters
- Long Text Processing: Handles up to 8192 tokens with sliding window technique
- High-Speed Inference: 3x faster than comparable models
- Arabic Language Optimization: Specifically fine-tuned for Arabic language nuances
- Resource Efficient: 75% less memory consumption than competitors
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Training Methodology
BTE-Base-Ar was trained on a diverse corpus of 741,159,981 tokens from:
- Authentic Arabic and English open-source datasets
- Manually crafted and processed text
- Purpose-generated synthetic data
This comprehensive training approach enables deep understanding of both Arabic & English linguistic contexts.
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("ALJIACHI/bte-base-ar")
# Run inference
sentences = [
'وبيّن: بمقتضى عقيدتنا قُل لَّن يُصِيبَنَا إِلَّا مَا كَتَبَ اللَّهُ لَنَا ، أي أنّ الإنسان المؤمن دائماً يكون في حالة طمأنينة، وهذه العلاقة ما بين العبد وربّه هي علاقة عبدٍ مع سيّده، وكما ورد في بعض الأدعية خيرُك إلينا نازل وشرُّنا إليك صاعد ، نحن نتعامل مع الله سبحانه وتعالى وهو محضُ الخير ومحضُ الرحمة، وكلّ ما يصدر من الله تبارك وتعالى على العبد أن يكون في منتهى العبوديّة والتذلّل اليه جلّ شأنُه .',
'أعلنت وزارة الصحة عن حملة تطعيم وطنية ضد الأمراض المعدية، تهدف إلى حماية الأطفال من العدوى.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Semantic Similarity
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.8598 |
spearman_cosine | 0.8538 |
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 4.0.1
- Transformers: 4.50.3
- PyTorch: 2.3.0+cu121
- Accelerate: 1.5.2
- Datasets: 3.5.0
- Tokenizers: 0.21.0
Citation
If you use BTE-Base-Ar in your research, please cite:
@software{BTE_Base_Ar_2025,
author = {Ali Aljiachi},
title = {BTE-Base-Ar: A Revolutionary Arabic Text Embeddings Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Aljiachi/bte-base-ar}
}
@misc{modernbert,
title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
year={2024},
eprint={2412.13663},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.13663},
}
- Downloads last month
- 27
Evaluation results
- Pearson Cosine on Unknownself-reported0.860
- Spearman Cosine on Unknownself-reported0.854