Reproducing Evaluation with lighteval

by PatrickHaller - opened Nov 18

Discussion

PatrickHaller

Nov 18

•

edited Nov 18

Hey!

For reproducibilities sake can you verify if this is the right configuration for evaluating with lighteval?

helm|hellaswag|0|0
lighteval|arc:easy|0|0
leaderboard|arc:challenge|0|0
helm|mmlu|0|0
helm|piqa|0|0
helm|commonsenseqa|0|0
lighteval|triviaqa|0|0
leaderboard|winogrande|0|0
lighteval|openbookqa|0|0
leaderboard|gsm8k|5|0

Furthermore:

Did you manually calculate average over the accuracy for easy and challenge for ARC?
What metrics did you report? Is it all accuracy?

Greetings,
Patrick

PatrickHaller

Nov 18

It also seems that some numbers might be wrong:

Winogrande 52.5 -> 54.62
GSM8K 3.2 -> 0.32
PIQA 71.3 -> 3.1 (em), 9.0 (qem), 3.8 (pem), 19.79 (pqem)
ETC.

Some numbers seems to be wildly different from what you reported....

|                     Task                      |Version| Metric |Value |   |Stderr|
|-----------------------------------------------|------:|--------|-----:|---|-----:|
|all                                            |       |em      |0.1994|±  |0.0285|
|                                               |       |qem     |0.2052|±  |0.0282|
|                                               |       |pem     |0.2423|±  |0.0308|
|                                               |       |pqem    |0.4098|±  |0.0352|
|                                               |       |acc     |0.4650|±  |0.0142|
|                                               |       |acc_norm|0.4796|±  |0.0151|
|helm:commonsenseqa:0                           |      0|em      |0.1949|±  |0.0113|
|                                               |       |qem     |0.1974|±  |0.0114|
|                                               |       |pem     |0.1949|±  |0.0113|
|                                               |       |pqem    |0.3129|±  |0.0133|
|helm:hellaswag:0                               |      0|em      |0.2173|±  |0.0041|
|                                               |       |qem     |0.2404|±  |0.0043|
|                                               |       |pem     |0.2297|±  |0.0042|
|                                               |       |pqem    |0.3162|±  |0.0046|
|helm:mmlu:_average:0                           |       |em      |0.2021|±  |0.0297|
|                                               |       |qem     |0.2109|±  |0.0303|
|                                               |       |pem     |0.2469|±  |0.0321|
|                                               |       |pqem    |0.4168|±  |0.0366|
--- MMLU subs ---
|helm:piqa:0                                    |      0|em      |0.0311|±  |0.0025|
|                                               |       |qem     |0.0904|±  |0.0041|
|                                               |       |pem     |0.0386|±  |0.0027|
|                                               |       |pqem    |0.1979|±  |0.0057|
|leaderboard:arc:challenge:0                    |      0|acc     |0.3660|±  |0.0141|
|                                               |       |acc_norm|0.3848|±  |0.0142|
|leaderboard:gsm8k:5                            |      0|qem     |0.0030|±  |0.0015|
|leaderboard:winogrande:0                       |      0|acc     |0.5462|±  |0.0140|
|lighteval:arc:easy:0                           |      0|acc     |0.7016|±  |0.0094|
|                                               |       |acc_norm|0.6801|±  |0.0096|
|lighteval:openbookqa:0                         |      0|acc     |0.2460|±  |0.0193|
|                                               |       |acc_norm|0.3740|±  |0.0217|
|lighteval:triviaqa:0                           |      0|qem     |0.1699|±  |0.0028|

loubnabnl

Hugging Face TB Research org Nov 22

•

edited Nov 22

Hi, you can find the evaluation details: https://github.com/huggingface/smollm/blob/main/evaluation/README.md (currently missing MMLU cloze, we'll add it soon)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment