Spaces:
Running
on
CPU Upgrade
SILMA's AlGhafa score is not correct
Congratulations on the new version, it is definitely a step in the right direction
I have noticed that silma-ai/SILMA-9B-Instruct-v1.0 AlGhafa score is extremely low (33.99), while in the previous version it was 71.85
Today, I run Lighteval on Alghafa tasks (examples/tasks/OALL_v2_tasks.txt) to double check and got the results below, could you please investigate this issue?
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
all | acc_norm | 0.7201 | ± | 0.0172 | |
community:alghafa:_average:0 | acc_norm | 0.7201 | ± | 0.0172 | |
community:alghafa:mcq_exams_test_ar:0 | 0 | acc_norm | 0.5135 | ± | 0.0212 |
community:alghafa:meta_ar_dialects:0 | 0 | acc_norm | 0.7068 | ± | 0.0062 |
community:alghafa:meta_ar_msa:0 | 0 | acc_norm | 0.8235 | ± | 0.0128 |
community:alghafa:multiple_choice_facts_truefalse_balanced_task:0 | 0 | acc_norm | 0.8933 | ± | 0.0359 |
community:alghafa:multiple_choice_grounded_statement_soqal_task:0 | 0 | acc_norm | 0.8867 | ± | 0.0260 |
community:alghafa:multiple_choice_grounded_statement_xglue_mlqa_task:0 | 0 | acc_norm | 0.8133 | ± | 0.0319 |
community:alghafa:multiple_choice_rating_sentiment_no_neutral_task:0 | 0 | acc_norm | 0.9071 | ± | 0.0032 |
community:alghafa:multiple_choice_rating_sentiment_task:0 | 0 | acc_norm | 0.6027 | ± | 0.0063 |
community:alghafa:multiple_choice_sentiment_task:0 | 0 | acc_norm | 0.3343 | ± | 0.0114 |
Hey
@karimouda
and thanks for the kind words.
Yes and as outlined in the this line of the blog, we identified a silent bug in the AlGhafa Task implementation. We haven't merged the PR to fix the issue in the official lighteval repo yet. It would be better to reproduce the results using this fork instead.
Neverthless, we will try to run it internally again just to avoid any potential mistakes.
Thank you.