Arabic AI Benchmarks and Leaderboards

Community Article Published March 4, 2025

image/webp

Over the past year, numerous benchmarks have been conducted to test various aspects of Arabic AI technologies, including LLM performance, Multimodality/Vision, Embedding, Retrieval, RAG Generation, SST, and OCR. This post serves as a comprehensive record of all benchmarks and leaderboards within the Arabic AI ecosystem. Our goal is to provide a centralized resource for the community to easily access and identify the appropriate benchmark for their evaluation tasks or to choose the top model for a specific task.

Leaderboards

Below is a list of leaderboards testing various aspects of Arabic AI Models

LLM Performance

Name What does it evaluate? Link Comments
Open Arabic LLM Leaderboard (OALL) v2 General Knowledge, MMLU, Grammar, RAG Generation, Trust & Safety, Sentiment Analysis & Dialects https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard v1 legacy
AraGen Question Answering, Orthographic and Grammatical Analysis, Reasoning, Safety https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard Closed datasets
Scale Seal Coding, Creative, Educational Support, Idea Development,Writing & Communication and others https://scale.com/leaderboard/arabic Closed datasets, evaluated manually by human experts

Embeddings

Name What does it evaluate? Link Comments
MTEB (Legacy) General embedding (Sentence to Sentence) https://huggingface.co/spaces/mteb/leaderboard_legacy You will need to click on STS -> Other -> then sort STS17 (ar-ar) column descending
The Arabic RAG Leaderboard Retrieval and Re-ranking https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard Adding RAG Generation component is planned

Vision / OCR

Name What does it evaluate? Link Comments
CAMEL-Bench Vision understanding, OCR, chart understanding, video, medical imaging, and more https://huggingface.co/spaces/ahmedheakl/CAMEL-Bench-leaderboard

Speech

Name What does it evaluate? Link Comments
Open Universal Arabic ASR Leaderboard multi-dialect Arabic ASR https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard

Tokenizers

Name What does it evaluate? Link Comments
Arabic Tokenizers Leaderboard Tokenizer efficiency via fertility score https://huggingface.co/spaces/MohamedRashad/arabic-tokenizers-leaderboard

Benchmarking datasets

Below is a non-comprehensive list of benchmarking dataset, it will grow by time.

Note:There are numerous research datasets available for benchmarking purposes, but in this list, we will focus on the most popular ones and the datasets which are commonly used in research papers to evaluate Arabic models.

General purpose

Name What does it evaluate? Link Comments
Balsam Index many tasks https://benchmarks.ksaa.gov.sa/b/balsam/tasks Data quality issues

RAG

Name What does it evaluate? Link Comments
SILMA RAGQA v1.0 17 bilingual datasets in Arabic and English, spanning various domains https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0

OCR

| KITAB-Bench | handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence | https://huggingface.co/collections/ahmedheakl/kitab-bench-677dd5d88d5db344d5595b78 | |

MMLU Arabic

Name What does it evaluate? Link Comments
Global MMLU MMLU https://huggingface.co/datasets/CohereForAI/Global-MMLU/viewer/ar
Arabic MMLU https://huggingface.co/datasets/MBZUAI/ArabicMMLU?row=0 multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions

Benchmark is missing?

If you believe that a benchmark or leaderboard is not included in the list, please leave a comment below so we can consider adding it.

Community

Sign up or log in to comment