Spaces:
Running
on
CPU Upgrade
score gap between leaderboard and local running
Hi, we are currently testing our model fangzhaoz/pearl7B_tuneonGSM8K, the leaderboard public score is "acc":0.3995451099317665 for GSM8k task while we got "exact_match" value 0.7688 with "Stderr" 0.0116 by running on lm harness evaluate locally.
Wondering why there is such a large gap, is it because We're looking at a wrong metric score locally?
FYI, here's the code we use for local evaluation:
lm_eval --model hf --model_args pretrained=fangzhaoz/pearl7B_tuneonGSM8K --tasks gsm8k --device cuda:0 --batch_size 8
and here's the results it returns:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
gsm8k | 2 | get-answer | 5 | exact_match | 0.7688 | ± | 0.0116 |
Hi!
Did you follow the same steps as we did, notably using the same commit of the harness? (I think you used a more recent version of the harness).
(Everything is detailed in the About page).
You can also take a look at the difference between your outputs and the details we save (accessible by clicking the page up icon next to the model name in the leaderboard).