Arabic AI Benchmarks and Leaderboards

Community Article Published March 4, 2025

silma-ai

Over the past year, numerous benchmarks have been conducted to test various aspects of Arabic AI technologies, including LLM performance, Multimodality/Vision, Embedding, Retrieval, RAG Generation, SST, and OCR. This post serves as a comprehensive record of all benchmarks and leaderboards within the Arabic AI ecosystem. Our goal is to provide a centralized resource for the community to easily access and identify the appropriate benchmark for their evaluation tasks or to choose the top model for a specific task.

Leaderboards

Below is a list of leaderboards testing various aspects of Arabic AI Models

LLM Performance

Name	What does it evaluate?	Link	Comments
Open Arabic LLM Leaderboard (OALL) v2	General Knowledge, MMLU, Grammar, RAG Generation, Trust & Safety, Sentiment Analysis & Dialects	https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard	v1 legacy
Arabic-Leaderboards	IFEval, Question Answering, Orthographic and Grammatical Analysis, Reasoning, Safety	https://huggingface.co/spaces/inceptionai/Arabic-Leaderboards	Closed datasets (except IFEval)
Scale Seal	Coding, Creative, Educational Support, Idea Development,Writing & Communication and others	https://scale.com/leaderboard/arabic	Closed datasets, evaluated manually by human experts
Arabic Broad Leaderboard (ABL)	Comprehensive evaluation of the Arabic language through testing proficiency in 22 skills and categories	https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard	Includes visualizations, analytical capabilities, model skill breakdowns, speed comparisons, and contamination detection mechanisms

Embeddings

Name	What does it evaluate?	Link	Comments
MTEB (Legacy)	General embedding (Sentence to Sentence)	https://huggingface.co/spaces/mteb/leaderboard_legacy	You will need to click on STS -> Other -> then sort STS17 (ar-ar) column descending
The Arabic RAG Leaderboard	Retrieval and Re-ranking	https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard	Adding RAG Generation component is planned

Vision / OCR

Name	What does it evaluate?	Link	Comments
CAMEL-Bench	Vision understanding, OCR, chart understanding, video, medical imaging, and more	https://huggingface.co/spaces/ahmedheakl/CAMEL-Bench-leaderboard

Speech

Name	What does it evaluate?	Link	Comments
Open Universal Arabic ASR Leaderboard	multi-dialect Arabic ASR	https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard

Tokenizers

Name	What does it evaluate?	Link	Comments
Arabic Tokenizers Leaderboard	Tokenizer efficiency via fertility score	https://huggingface.co/spaces/MohamedRashad/arabic-tokenizers-leaderboard

Benchmarking datasets

Below is a non-comprehensive list of benchmarking dataset, it will grow by time.

Note:There are numerous research datasets available for benchmarking purposes, but in this list, we will focus on the most popular ones and the datasets which are commonly used in research papers to evaluate Arabic models.

General purpose

Name	What does it evaluate?	Link	Comments
Balsam Index	many tasks	https://benchmarks.ksaa.gov.sa/b/balsam/tasks	Data quality issues

RAG

Name	What does it evaluate?	Link	Comments
SILMA RAGQA v1.0	17 bilingual datasets in Arabic and English, spanning various domains	https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0

OCR

Name	What does it evaluate?	Link	Comments
KITAB-Bench	handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence	https://huggingface.co/collections/ahmedheakl/kitab-bench-677dd5d88d5db344d5595b78

MMLU Arabic

Name	What does it evaluate?	Link	Comments
Global MMLU	MMLU	https://huggingface.co/datasets/CohereForAI/Global-MMLU/viewer/ar
Arabic MMLU		https://huggingface.co/datasets/MBZUAI/ArabicMMLU?row=0	multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions

Benchmark is missing?

If you believe that a benchmark or leaderboard is not included in the list, please leave a comment below so we can consider adding it.

Community

anwarvic

May 21

Thanks, everyone for this comprehensive curation of Arabic leaderboards. Here are a few more that I came across and wanted to share with you:

Al-Abdulkarim

21 days ago

I’d like to suggest a missing benchmark:
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
It’s focused on evaluating Arabic models on multimodal and reasoning capabilities. Please consider adding it to the list!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote