SILMA's AlGhafa score is not correct

#16
by karimouda - opened

Congratulations on the new version, it is definitely a step in the right direction

I have noticed that silma-ai/SILMA-9B-Instruct-v1.0 AlGhafa score is extremely low (33.99), while in the previous version it was 71.85

Today, I run Lighteval on Alghafa tasks (examples/tasks/OALL_v2_tasks.txt) to double check and got the results below, could you please investigate this issue?

Task Version Metric Value Stderr
all acc_norm 0.7201 ± 0.0172
community:alghafa:_average:0 acc_norm 0.7201 ± 0.0172
community:alghafa:mcq_exams_test_ar:0 0 acc_norm 0.5135 ± 0.0212
community:alghafa:meta_ar_dialects:0 0 acc_norm 0.7068 ± 0.0062
community:alghafa:meta_ar_msa:0 0 acc_norm 0.8235 ± 0.0128
community:alghafa:multiple_choice_facts_truefalse_balanced_task:0 0 acc_norm 0.8933 ± 0.0359
community:alghafa:multiple_choice_grounded_statement_soqal_task:0 0 acc_norm 0.8867 ± 0.0260
community:alghafa:multiple_choice_grounded_statement_xglue_mlqa_task:0 0 acc_norm 0.8133 ± 0.0319
community:alghafa:multiple_choice_rating_sentiment_no_neutral_task:0 0 acc_norm 0.9071 ± 0.0032
community:alghafa:multiple_choice_rating_sentiment_task:0 0 acc_norm 0.6027 ± 0.0063
community:alghafa:multiple_choice_sentiment_task:0 0 acc_norm 0.3343 ± 0.0114

Screenshot 2025-02-12 at 22.54.47.png

Open Arabic LLM Leaderboard org
edited 2 days ago

Hey @karimouda and thanks for the kind words.
Yes and as outlined in the this line of the blog, we identified a silent bug in the AlGhafa Task implementation. We haven't merged the PR to fix the issue in the official lighteval repo yet. It would be better to reproduce the results using this fork instead.
Neverthless, we will try to run it internally again just to avoid any potential mistakes.
Thank you.

alielfilali01 changed discussion status to closed

Sign up or log in to comment