ABBL: NextGen LLM Benchmark & Leaderboard for evaluating Arabic models

Community Article Published May 18, 2025

The Arabic Broad Benchmark and Leaderboard (ABBL) is an advanced LLM leaderboard and benchmark from SILMA.AI offering innovative visualizations, analytical capabilities, model skill breakdowns, speed comparisons, and contamination detection mechanisms.

ABBL provides the community with an unprecedented ability to study the capabilities of Arabic models and choose the right model for the right task with confidence.

TL;DR

Human-validated, compact dataset of only 470 questions covering 22 Arabic language skills, sampled from 64 diverse datasets
A new evaluation approach combines customized manual rules and LLM as Judge variations, tailored to the specific question types
Unique features for analyzing and visually comparing models
Novel contamination detection mechanisms
Sub-leaderboards for models within specific size ranges to ensure more accurate and fair comparisons
Incorporating model speed metrics alongside model performance metrics

The Leaderboard: https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard

The Benchmark: https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark

Why another Benchmark?

What can't be measured can't be improved

At SILMA.AI we aim to build state-of-the-art language models for Arabic language by building over open-source models instead of reinventing the wheel.

To do so, we needed to evaluate the base models across all skills as well as finding the right one for the right task. However, we've recently reached a point where all current datasets and benchmarks did not meet our standards for accurately evaluating models with a sufficient level of confidence that we can rely on to make business decisions.

Issues we aimed to address in existing benchmarks

Existing Arabic benchmarks focus narrowly on a limited set of skills (max 8) such as reasoning and question answering while not fully covering the unique features of the Arabic language, such as its rich variety of dialects, complex grammar, diacritization, etc
Public benchmarks can easily be contaminated, thus unreliable
Private benchmarks (closed datasets) are not accessible to the community and lack the same level of trust compared to public benchmarks
Current benchmarks either focuses on MCQ or Generation questions and not both
Some benchmarks still suffer from data quality issues which reduces confidence in the results
The available benchmarks are resource and time-intensive to run, relying on heavy evaluation frameworks and development libraries, some of which do not support new models quickly enough
Finally, we needed to have the ability to compare both closed and open models and integrate the benchmark in our internal pipelines

The Dataset

We introduce a compact yet comprehensive dataset comprising 470 high-quality, human-validated questions. Sampled from 64 existing Arabic benchmarking datasets - this includes datasets from pioneering Arabic benchmarks like OALL and Arabic Leaderboards.

Our goal was to provide a broad assessment of a model's overall Arabic language performance, rather than focusing intensely on a single task.

The data assesses 22 skills with a focus on the specificities of the Arabic language, from writing in dialects and diacritics to reasoning and MMLU. To the best of our knowledge, this is the first dataset of its kind for the Arabic language.

Category Statistics

category	counts	percentage
MMLU	121	25.74%
General Knowledge	63	13.4%
Reasoning & Math	43	9.15%
RAG QA	41	8.72%
Translation (incl Dialects)	36	7.66%
Trust & Safety	30	6.38%
Writing (incl Dialects)	22	4.68%
Reading Comprehension	17	3.62%
Arabic Language & Grammar	17	3.62%
Diacritization	12	2.55%
Dialect Detection	11	2.34%
Sentiment Analysis	9	1.91%
Summarization	8	1.7%
Instruction Following	7	1.49%
Transliteration	6	1.28%
Paraphrasing	6	1.28%
Entity Extraction	5	1.06%
Long Context	4	0.85%
Coding	3	0.64%
Hallucination	3	0.64%
Function Calling	3	0.64%
Structuring	3	0.64%

Subcategories

Questions Format

format	counts	percentage
MCQ	229	48.72%
Generation	228	48.51%
Fill-in-the-blank	8	1.7%
Short Answer	5	1.06%

Dataset generation process

Hundreds of questions were sampled from 64 diverse Arabic benchmarking datasets
An initial automated quality check using GPT-4.1 and Gemini 2.5 eliminated questions unanswerable by both models (over 50% reduction)
Remaining questions underwent human validation, including inspection, answering, and cross-referencing with LLM responses (further 20% reduction)
Rewording of questions and updating of reference answers occurred during human validation
This multi-stage filtering yielded a final set of 470 high-quality questions
The final set was subjected to additional testing and refinement during the benchmarking stage

The dataset serves as the foundation for our Arabic Broad Benchmark (ABB).

The Arabic Broad Benchmark (ABB)

ABB is an open-source benchmarking system that utilizes our new comprehensive dataset to evaluate Arabic LLMs on Hugging Face as well as APIs.

Methodology

The benchmarking script employs a sophisticated mix of over 20 manual evaluation rules and customized "LLM-as-judge" variations, tailored specifically to each skill and question type being assessed.

Example: to evaluate the accuracy of Arabic diacritization, the MANUAL_DIACRITIZATION rule is employed. This method assesses the difference between the reference text and the generated text at the character level. This approach is used instead of relying on an LLM as a judge prompt, as LLMs are not dependable for evaluating such fine-grained distinctions.

Below is the list of custom scoring rules used in during benchmarking:

Scoring Rule	Count	Description
AUTOMATED_LLM_AS_A_JUDGE_MCQ	218	Automated scoring using an LLM as a judge for Multiple Choice Questions. (custom prompt)
AUTOMATED_LLM_AS_A_JUDGE_GENERATION	173	Automated scoring using an LLM as a judge for text generation tasks. (custom prompt)
MANUAL_ROUGE_SCORE	65	Manual calculation of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score.
MANUAL_METEOR_SCORE	34	Manual calculation of METEOR (Metric for Evaluation of Translation with Explicit ORdering) score.
AUTOMATED_LLM_AS_A_JUDGE_WRITING_DIALECT	30	Automated scoring using an LLM judge for dialect accuracy in writing. (custom prompt)
AUTOMATED_LLM_AS_A_JUDGE_REASONING	21	Automated scoring using an LLM judge for reasoning capabilities. (custom prompt)
MANUAL_WORDS_INTERSECTION	19	Manual check for the intersection of words between generated and reference text.
MANUAL_DIACRITIZATION	12	Manual scoring of diacritization accuracy using Levenshtein distance + other conditions
MANUAL_DIALECT_MATCHING	11	Manual scoring for how well a generated dialect matches a target dialect.
MANUAL_RELATIVE_MIN_DISTANCE	6	Manual calculation of the relative change in distance (Levenshtein) between base to reference text and generated to reference text
MANUAL_CLOSE_TO_REFERENCE_LENGTH	6	Manual check if the generated text length is close to the reference text length.
MANUAL_MIN_DISTANCE	6	Manual calculation of minimum edit distance (Levenshtein).
MANUAL_IS_VALID_JSON	5	Manual check if the output is valid JSON format.
AUTOMATED_LLM_AS_A_JUDGE_GRAMMAR_IRAB	3	Automated LLM as a judge for grammar 'Irab'. (custom prompt)
MANUAL_IFEVAL_1	3	Manual evaluation based on a specific 'IFEVAL' criterion (version 1).
MANUAL_STRUCTURING_1	3	Manual evaluation of output structuring for each relevant question.
MANUAL_IFEVAL_2	2	Manual evaluation based on a specific 'IFEVAL' criterion (version 2).
MANUAL_MRCR_FIRST_LINE_MATCH	2	Manual check if the first line in generated matches reference by checking the Levenshtein distance of the first 100 characters only

Efficiency

Using ABB, you can evaluate models (up to 15B parameters) quickly and efficiently, typically completing the process in under an hour on a single GPU.

Skill breakdown and Speed

Upon completion, the system provides a detailed skill-level breakdown, allowing you to clearly understand the strengths and weaknesses of each evaluated model. Additionally, you also get the speed of the model (words per second) as well as all the model responses in a beautiful HTML file for further analysis.

How does scoring work?

Each question is scored from 0 to 10 using manual rules, LLM as Judge or both. The final benchmark score is calculated by taking the average of all individual question scores

Open and accessible to everyone

Evaluating a model using the ABB benchmark is a simple three-step process. Detailed instructions are provided in the following link:

https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark#how-to-use-abb-to-benchmark-a-model

Other features

This benchmark allows testing of custom APIs alongside HuggingFace models. It also supports batching for quicker evaluations. Furthermore, the benchmark can now handle <thinking> models by extracting and evaluating only the text following these tags.

The Arabic Broad Leaderboard (ABL)

The ABL sets a new standard for evaluating Arabic models by including innovative and distinct features that are not commonly found on other leaderboards.

Key Innovations

Contamination detection: a novel contamination prevention method utilizing proprietary code to determine the probability of a model encountering/using the test data during training. The contamination score is shown alongside the model's output with a red sign.

To maintain the integrity of the leaderboard, rigorous measures are implemented to avoid repeated model evaluations. Also organizations and accounts are limited to one submission each month.

To prevent optimization for a lower contamination score, we have concealed details such as the algorithm, the threshold, and any scores falling below it.

Furthermore, any model exhibiting evidence of contamination is promptly removed and subjected to further investigation. As a final measure, a banning mechanism is in place to prevent abuse.
Speed: comparing models in terms of speed and performance.

Model speed, measured in words per second, is determined by dividing the total number of words generated during testing by the testing time in seconds. To ensure a fair comparison among Hugging Face models, we utilize the same GPU (A100) and a batch size of 1 for all models. Models exceeding 15 billion parameters are distributed across multiple GPUs.

Comparisons should be limited to models within the same size category. API or closed models can only be compared to other API models as they are not hosted on our infrastructure.
Size sub-leaderboards: adding leaderboard sections to allow comparison of models based on size. This will address questions like: What is the best Arabic language model with fewer than 10 billion parameters?

Model size categories are defined as follows:
- Nano: Models with less than 3.5 billion parameters.
- Small: Models ranging from 3.5 billion to 10 billion parameters.
- Medium: Models ranging from 10 billion to 35 billion parameters.
- Large: Models exceeding 35 billion parameters.
Skill sub-leaderboards: Integrate leaderboard sections to facilitate model comparison based on capabilities. Address questions like identifying the top Arabic model for long context processing.
Visual comparison: comparing two or more models by skill using a radar chart.
Deep dive: this report details a specific model, outlining its strengths and weaknesses. For increased transparency, all of the model's outputs are also provided.

Diversity of model sources

To provide a comprehensive view of high-performing Arabic language models, we benchmarked both:

APIs: Closed-source models accessed and tested through their respective APIs.
Hugging Face: Open-source models downloaded from Hugging Face and evaluated using the transformers library.

Outro

ABL offers the community a unique opportunity to evaluate Arabic language models and choose the optimal one for particular applications. Moreover, the novel characteristics of ABL are intended to encourage the creation of more sophisticated, well-managed, and visually informative leaderboards and benchmarks.

We are excited to see how both technical and business users will utilize the benchmark and leaderboard to make more informed decisions.

Ready to learn more?

Arabic Broad Leaderboard (ABL):

https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard
Arabic Broad Benchmark (ABB) Dataset & Script:

https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote