Arabic AI Benchmarks and Leaderboards

Over the past year, numerous benchmarks have been conducted to test various aspects of Arabic AI technologies, including LLM performance, Multimodality/Vision, Embedding, Retrieval, RAG Generation, SST, and OCR. This post serves as a comprehensive record of all benchmarks and leaderboards within the Arabic AI ecosystem. Our goal is to provide a centralized resource for the community to easily access and identify the appropriate benchmark for their evaluation tasks or to choose the top model for a specific task.
Leaderboards
Below is a list of leaderboards testing various aspects of Arabic AI Models
LLM Performance
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
Open Arabic LLM Leaderboard (OALL) v2 | General Knowledge, MMLU, Grammar, RAG Generation, Trust & Safety, Sentiment Analysis & Dialects | https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard | v1 legacy |
AraGen | Question Answering, Orthographic and Grammatical Analysis, Reasoning, Safety | https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard | Closed datasets |
Scale Seal | Coding, Creative, Educational Support, Idea Development,Writing & Communication and others | https://scale.com/leaderboard/arabic | Closed datasets, evaluated manually by human experts |
Embeddings
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
MTEB (Legacy) | General embedding (Sentence to Sentence) | https://huggingface.co/spaces/mteb/leaderboard_legacy | You will need to click on STS -> Other -> then sort STS17 (ar-ar) column descending |
The Arabic RAG Leaderboard | Retrieval and Re-ranking | https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard | Adding RAG Generation component is planned |
Vision / OCR
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
CAMEL-Bench | Vision understanding, OCR, chart understanding, video, medical imaging, and more | https://huggingface.co/spaces/ahmedheakl/CAMEL-Bench-leaderboard |
Speech
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
Open Universal Arabic ASR Leaderboard | multi-dialect Arabic ASR | https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard |
Tokenizers
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
Arabic Tokenizers Leaderboard | Tokenizer efficiency via fertility score | https://huggingface.co/spaces/MohamedRashad/arabic-tokenizers-leaderboard |
Benchmarking datasets
Below is a non-comprehensive list of benchmarking dataset, it will grow by time.
Note:There are numerous research datasets available for benchmarking purposes, but in this list, we will focus on the most popular ones and the datasets which are commonly used in research papers to evaluate Arabic models.
General purpose
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
Balsam Index | many tasks | https://benchmarks.ksaa.gov.sa/b/balsam/tasks | Data quality issues |
RAG
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
SILMA RAGQA v1.0 | 17 bilingual datasets in Arabic and English, spanning various domains | https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0 |
OCR
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
KITAB-Bench | handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence | https://huggingface.co/collections/ahmedheakl/kitab-bench-677dd5d88d5db344d5595b78 |
MMLU Arabic
Name | What does it evaluate? | Link | Comments |
---|---|---|---|
Global MMLU | MMLU | https://huggingface.co/datasets/CohereForAI/Global-MMLU/viewer/ar | |
Arabic MMLU | https://huggingface.co/datasets/MBZUAI/ArabicMMLU?row=0 | multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions |
Benchmark is missing?
If you believe that a benchmark or leaderboard is not included in the list, please leave a comment below so we can consider adding it.