Arabic AI Benchmarks and Leaderboards

Community Article Published March 4, 2025

Over the past year, numerous benchmarks have been conducted to test various aspects of Arabic AI technologies, including LLM performance, Multimodality/Vision, Embedding, Retrieval, RAG Generation, SST, and OCR. This post serves as a comprehensive record of all benchmarks and leaderboards within the Arabic AI ecosystem. Our goal is to provide a centralized resource for the community to easily access and identify the appropriate benchmark for their evaluation tasks or to choose the top model for a specific task.

Leaderboards

Below is a list of leaderboards testing various aspects of Arabic AI Models

LLM Performance

Name	What does it evaluate?	Link	Comments
Open Arabic LLM Leaderboard (OALL) v2	General Knowledge, MMLU, Grammar, RAG Generation, Trust & Safety, Sentiment Analysis & Dialects	https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard	v1 legacy
AraGen	Question Answering, Orthographic and Grammatical Analysis, Reasoning, Safety	https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard	Closed datasets
Scale Seal	Coding, Creative, Educational Support, Idea Development,Writing & Communication and others	https://scale.com/leaderboard/arabic	Closed datasets, evaluated manually by human experts

Embeddings

Name	What does it evaluate?	Link	Comments
MTEB (Legacy)	General embedding (Sentence to Sentence)	https://huggingface.co/spaces/mteb/leaderboard_legacy	You will need to click on STS -> Other -> then sort STS17 (ar-ar) column descending
The Arabic RAG Leaderboard	Retrieval and Re-ranking	https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard	Adding RAG Generation component is planned

Vision / OCR

Name	What does it evaluate?	Link	Comments
CAMEL-Bench	Vision understanding, OCR, chart understanding, video, medical imaging, and more	https://huggingface.co/spaces/ahmedheakl/CAMEL-Bench-leaderboard

Speech

Name	What does it evaluate?	Link	Comments
Open Universal Arabic ASR Leaderboard	multi-dialect Arabic ASR	https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard

Tokenizers

Name	What does it evaluate?	Link	Comments
Arabic Tokenizers Leaderboard	Tokenizer efficiency via fertility score	https://huggingface.co/spaces/MohamedRashad/arabic-tokenizers-leaderboard

Benchmarking datasets

Below is a non-comprehensive list of benchmarking dataset, it will grow by time.

Note:There are numerous research datasets available for benchmarking purposes, but in this list, we will focus on the most popular ones and the datasets which are commonly used in research papers to evaluate Arabic models.

General purpose

Name	What does it evaluate?	Link	Comments
Balsam Index	many tasks	https://benchmarks.ksaa.gov.sa/b/balsam/tasks	Data quality issues

RAG

Name	What does it evaluate?	Link	Comments
SILMA RAGQA v1.0	17 bilingual datasets in Arabic and English, spanning various domains	https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0

OCR

| KITAB-Bench | handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence | https://huggingface.co/collections/ahmedheakl/kitab-bench-677dd5d88d5db344d5595b78 | |

MMLU Arabic

Name	What does it evaluate?	Link	Comments
Global MMLU	MMLU	https://huggingface.co/datasets/CohereForAI/Global-MMLU/viewer/ar
Arabic MMLU		https://huggingface.co/datasets/MBZUAI/ArabicMMLU?row=0	multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions

Benchmark is missing?

If you believe that a benchmark or leaderboard is not included in the list, please leave a comment below so we can consider adding it.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote