ABBL: NextGen LLM Benchmark & Leaderboard for evaluating Arabic models

The Arabic Broad Benchmark and Leaderboard (ABBL) is an advanced LLM leaderboard and benchmark from SILMA.AI offering innovative visualizations, analytical capabilities, model skill breakdowns, speed comparisons, and contamination detection mechanisms.
ABBL provides the community with an unprecedented ability to study the capabilities of Arabic models and choose the right model for the right task with confidence.
TL;DR
- Human-validated, compact dataset of only 470 questions covering 22 Arabic language skills, sampled from 64 diverse datasets
- A new evaluation approach combines customized manual rules and LLM as Judge variations, tailored to the specific question types
- Unique features for analyzing and visually comparing models
- Novel contamination detection mechanisms
- Sub-leaderboards for models within specific size ranges to ensure more accurate and fair comparisons
- Incorporating model speed metrics alongside model performance metrics
The Leaderboard: https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard
The Benchmark: https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark
Why another Benchmark?
What can't be measured can't be improved
At SILMA.AI we aim to build state-of-the-art language models for Arabic language by building over open-source models instead of reinventing the wheel.
To do so, we needed to evaluate the base models across all skills as well as finding the right one for the right task. However, we've recently reached a point where all current datasets and benchmarks did not meet our standards for accurately evaluating models with a sufficient level of confidence that we can rely on to make business decisions.
Issues we aimed to address in existing benchmarks
- Existing Arabic benchmarks focus narrowly on a limited set of skills (max 8) such as reasoning and question answering while not fully covering the unique features of the Arabic language, such as its rich variety of dialects, complex grammar, diacritization, etc
- Public benchmarks can easily be contaminated, thus unreliable
- Private benchmarks (closed datasets) are not accessible to the community and lack the same level of trust compared to public benchmarks
- Current benchmarks either focuses on MCQ or Generation questions and not both
- Some benchmarks still suffer from data quality issues which reduces confidence in the results
- The available benchmarks are resource and time-intensive to run, relying on heavy evaluation frameworks and development libraries, some of which do not support new models quickly enough
- Finally, we needed to have the ability to compare both closed and open models and integrate the benchmark in our internal pipelines
The Dataset
We introduce a compact yet comprehensive dataset comprising 470 high-quality, human-validated questions. Sampled from 64 existing Arabic benchmarking datasets - this includes datasets from pioneering Arabic benchmarks like OALL and Arabic Leaderboards.
Our goal was to provide a broad assessment of a model's overall Arabic language performance, rather than focusing intensely on a single task.
The data assesses 22 skills with a focus on the specificities of the Arabic language, from writing in dialects and diacritics to reasoning and MMLU. To the best of our knowledge, this is the first dataset of its kind for the Arabic language.
Category Statistics
category | counts | percentage |
---|---|---|
MMLU | 121 | 25.74% |
General Knowledge | 63 | 13.4% |
Reasoning & Math | 43 | 9.15% |
RAG QA | 41 | 8.72% |
Translation (incl Dialects) | 36 | 7.66% |
Trust & Safety | 30 | 6.38% |
Writing (incl Dialects) | 22 | 4.68% |
Reading Comprehension | 17 | 3.62% |
Arabic Language & Grammar | 17 | 3.62% |
Diacritization | 12 | 2.55% |
Dialect Detection | 11 | 2.34% |
Sentiment Analysis | 9 | 1.91% |
Summarization | 8 | 1.7% |
Instruction Following | 7 | 1.49% |
Transliteration | 6 | 1.28% |
Paraphrasing | 6 | 1.28% |
Entity Extraction | 5 | 1.06% |
Long Context | 4 | 0.85% |
Coding | 3 | 0.64% |
Hallucination | 3 | 0.64% |
Function Calling | 3 | 0.64% |
Structuring | 3 | 0.64% |
Subcategories
Questions Format
format | counts | percentage |
---|---|---|
MCQ | 229 | 48.72% |
Generation | 228 | 48.51% |
Fill-in-the-blank | 8 | 1.7% |
Short Answer | 5 | 1.06% |
Dataset generation process
- Hundreds of questions were sampled from 64 diverse Arabic benchmarking datasets
- An initial automated quality check using GPT-4.1 and Gemini 2.5 eliminated questions unanswerable by both models (over 50% reduction)
- Remaining questions underwent human validation, including inspection, answering, and cross-referencing with LLM responses (further 20% reduction)
- Rewording of questions and updating of reference answers occurred during human validation
- This multi-stage filtering yielded a final set of 470 high-quality questions
- The final set was subjected to additional testing and refinement during the benchmarking stage
The dataset serves as the foundation for our Arabic Broad Benchmark (ABB).
The Arabic Broad Benchmark (ABB)
ABB is an open-source benchmarking system that utilizes our new comprehensive dataset to evaluate Arabic LLMs on Hugging Face as well as APIs.
Methodology
The benchmarking script employs a sophisticated mix of over 20 manual evaluation rules and customized "LLM-as-judge" variations, tailored specifically to each skill and question type being assessed.
Example: to evaluate the accuracy of Arabic diacritization, the MANUAL_DIACRITIZATION rule is employed. This method assesses the difference between the reference text and the generated text at the character level. This approach is used instead of relying on an LLM as a judge prompt, as LLMs are not dependable for evaluating such fine-grained distinctions.
Below is the list of custom scoring rules used in during benchmarking:
Scoring Rule | Count | Description |
---|---|---|
AUTOMATED_LLM_AS_A_JUDGE_MCQ | 218 | Automated scoring using an LLM as a judge for Multiple Choice Questions. (custom prompt) |
AUTOMATED_LLM_AS_A_JUDGE_GENERATION | 173 | Automated scoring using an LLM as a judge for text generation tasks. (custom prompt) |
MANUAL_ROUGE_SCORE | 65 | Manual calculation of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score. |
MANUAL_METEOR_SCORE | 34 | Manual calculation of METEOR (Metric for Evaluation of Translation with Explicit ORdering) score. |
AUTOMATED_LLM_AS_A_JUDGE_WRITING_DIALECT | 30 | Automated scoring using an LLM judge for dialect accuracy in writing. (custom prompt) |
AUTOMATED_LLM_AS_A_JUDGE_REASONING | 21 | Automated scoring using an LLM judge for reasoning capabilities. (custom prompt) |
MANUAL_WORDS_INTERSECTION | 19 | Manual check for the intersection of words between generated and reference text. |
MANUAL_DIACRITIZATION | 12 | Manual scoring of diacritization accuracy using Levenshtein distance + other conditions |
MANUAL_DIALECT_MATCHING | 11 | Manual scoring for how well a generated dialect matches a target dialect. |
MANUAL_RELATIVE_MIN_DISTANCE | 6 | Manual calculation of the relative change in distance (Levenshtein) between base to reference text and generated to reference text |
MANUAL_CLOSE_TO_REFERENCE_LENGTH | 6 | Manual check if the generated text length is close to the reference text length. |
MANUAL_MIN_DISTANCE | 6 | Manual calculation of minimum edit distance (Levenshtein). |
MANUAL_IS_VALID_JSON | 5 | Manual check if the output is valid JSON format. |
AUTOMATED_LLM_AS_A_JUDGE_GRAMMAR_IRAB | 3 | Automated LLM as a judge for grammar 'Irab'. (custom prompt) |
MANUAL_IFEVAL_1 | 3 | Manual evaluation based on a specific 'IFEVAL' criterion (version 1). |
MANUAL_STRUCTURING_1 | 3 | Manual evaluation of output structuring for each relevant question. |
MANUAL_IFEVAL_2 | 2 | Manual evaluation based on a specific 'IFEVAL' criterion (version 2). |
MANUAL_MRCR_FIRST_LINE_MATCH | 2 | Manual check if the first line in generated matches reference by checking the Levenshtein distance of the first 100 characters only |
Efficiency
Using ABB, you can evaluate models (up to 15B parameters) quickly and efficiently, typically completing the process in under an hour on a single GPU.
Skill breakdown and Speed
Upon completion, the system provides a detailed skill-level breakdown, allowing you to clearly understand the strengths and weaknesses of each evaluated model. Additionally, you also get the speed of the model (words per second) as well as all the model responses in a beautiful HTML file for further analysis.
How does scoring work?
Each question is scored from 0 to 10 using manual rules, LLM as Judge or both. The final benchmark score is calculated by taking the average of all individual question scores
Open and accessible to everyone
Evaluating a model using the ABB benchmark is a simple three-step process. Detailed instructions are provided in the following link:
Other features
This benchmark allows testing of custom APIs alongside HuggingFace models. It also supports batching for quicker evaluations. Furthermore, the benchmark can now handle <thinking>
models by extracting and evaluating only the text following these tags.
The Arabic Broad Leaderboard (ABL)
The ABL sets a new standard for evaluating Arabic models by including innovative and distinct features that are not commonly found on other leaderboards.
Key Innovations
Contamination detection: a novel contamination prevention method utilizing proprietary code to determine the probability of a model encountering/using the test data during training. The contamination score is shown alongside the model's output with a red sign.
To maintain the integrity of the leaderboard, rigorous measures are implemented to avoid repeated model evaluations. Also organizations and accounts are limited to one submission each month.
To prevent optimization for a lower contamination score, we have concealed details such as the algorithm, the threshold, and any scores falling below it.
Furthermore, any model exhibiting evidence of contamination is promptly removed and subjected to further investigation. As a final measure, a banning mechanism is in place to prevent abuse.
Speed: comparing models in terms of speed and performance.
Model speed, measured in words per second, is determined by dividing the total number of words generated during testing by the testing time in seconds. To ensure a fair comparison among Hugging Face models, we utilize the same GPU (A100) and a batch size of 1 for all models. Models exceeding 15 billion parameters are distributed across multiple GPUs.
Comparisons should be limited to models within the same size category. API or closed models can only be compared to other API models as they are not hosted on our infrastructure.
Size sub-leaderboards: adding leaderboard sections to allow comparison of models based on size. This will address questions like: What is the best Arabic language model with fewer than 10 billion parameters?
Model size categories are defined as follows:
- Nano: Models with less than 3.5 billion parameters.
- Small: Models ranging from 3.5 billion to 10 billion parameters.
- Medium: Models ranging from 10 billion to 35 billion parameters.
- Large: Models exceeding 35 billion parameters.
Skill sub-leaderboards: Integrate leaderboard sections to facilitate model comparison based on capabilities. Address questions like identifying the top Arabic model for long context processing.
Visual comparison: comparing two or more models by skill using a radar chart.
Deep dive: this report details a specific model, outlining its strengths and weaknesses. For increased transparency, all of the model's outputs are also provided.
Diversity of model sources
To provide a comprehensive view of high-performing Arabic language models, we benchmarked both:
- APIs: Closed-source models accessed and tested through their respective APIs.
- Hugging Face: Open-source models downloaded from Hugging Face and evaluated using the transformers library.
Outro
ABL offers the community a unique opportunity to evaluate Arabic language models and choose the optimal one for particular applications. Moreover, the novel characteristics of ABL are intended to encourage the creation of more sophisticated, well-managed, and visually informative leaderboards and benchmarks.
We are excited to see how both technical and business users will utilize the benchmark and leaderboard to make more informed decisions.
Ready to learn more?
Arabic Broad Leaderboard (ABL):
https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard
Arabic Broad Benchmark (ABB) Dataset & Script:
https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark