open-llm-leaderboard/open_llm_leaderboard · score gap between leaderboard and local running

Feb 20, 2024

Hi, we are currently testing our model fangzhaoz/pearl7B_tuneonGSM8K, the leaderboard public score is "acc":0.3995451099317665 for GSM8k task while we got "exact_match" value 0.7688 with "Stderr" 0.0116 by running on lm harness evaluate locally.

Wondering why there is such a large gap, is it because We're looking at a wrong metric score locally?

FYI, here's the code we use for local evaluation:
lm_eval --model hf --model_args pretrained=fangzhaoz/pearl7B_tuneonGSM8K --tasks gsm8k --device cuda:0 --batch_size 8

and here's the results it returns:

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
gsm8k	2	get-answer	5	exact_match	0.7688	±	0.0116

clefourrier

Open LLM Leaderboard org Feb 21, 2024

Hi!
Did you follow the same steps as we did, notably using the same commit of the harness? (I think you used a more recent version of the harness).
(Everything is detailed in the About page).
You can also take a look at the difference between your outputs and the details we save (accessible by clicking the page up icon next to the model name in the leaderboard).

clefourrier changed discussion status to closed Feb 26, 2024