OALL/Open-Arabic-LLM-Leaderboard · SILMA's AlGhafa score is not correct

3 days ago

Congratulations on the new version, it is definitely a step in the right direction

I have noticed that silma-ai/SILMA-9B-Instruct-v1.0 AlGhafa score is extremely low (33.99), while in the previous version it was 71.85

Today, I run Lighteval on Alghafa tasks (examples/tasks/OALL_v2_tasks.txt) to double check and got the results below, could you please investigate this issue?

Task	Version	Metric	Value		Stderr
all		acc_norm	0.7201	±	0.0172
community:alghafa:_average:0		acc_norm	0.7201	±	0.0172
community:alghafa:mcq_exams_test_ar:0	0	acc_norm	0.5135	±	0.0212
community:alghafa:meta_ar_dialects:0	0	acc_norm	0.7068	±	0.0062
community:alghafa:meta_ar_msa:0	0	acc_norm	0.8235	±	0.0128
community:alghafa:multiple_choice_facts_truefalse_balanced_task:0	0	acc_norm	0.8933	±	0.0359
community:alghafa:multiple_choice_grounded_statement_soqal_task:0	0	acc_norm	0.8867	±	0.0260
community:alghafa:multiple_choice_grounded_statement_xglue_mlqa_task:0	0	acc_norm	0.8133	±	0.0319
community:alghafa:multiple_choice_rating_sentiment_no_neutral_task:0	0	acc_norm	0.9071	±	0.0032
community:alghafa:multiple_choice_rating_sentiment_task:0	0	acc_norm	0.6027	±	0.0063
community:alghafa:multiple_choice_sentiment_task:0	0	acc_norm	0.3343	±	0.0114

alielfilali01

Open Arabic LLM Leaderboard org 2 days ago

•

edited 2 days ago

Hey @karimouda and thanks for the kind words.
Yes and as outlined in the this line of the blog, we identified a silent bug in the AlGhafa Task implementation. We haven't merged the PR to fix the issue in the official lighteval repo yet. It would be better to reproduce the results using this fork instead.
Neverthless, we will try to run it internally again just to avoid any potential mistakes.
Thank you.

alielfilali01 changed discussion status to closed 2 days ago