The lm-evaluation-harness results are different from the leaderboard results.

#659
by jisukim8873 - opened

When I run the gsm8k and other metrics with lm-evaluation-harness, the results are very different from the leaderboard results. Has anyone else experienced the same thing? Can anyone tell me what could be the cause?

Below are the commands

lm_eval --model hf --model_args pretrained=$1 --tasks arc_challenge --device cuda:1 --num_fewshot 25 --batch_size 2 --output_path $2/arc
lm_eval --model hf --model_args pretrained=$1 --tasks hellaswag --device cuda:1 --num_fewshot 10 --batch_size 1 --output_path $2/hellaswag
lm_eval --model hf --model_args pretrained=$1 --tasks mmlu --device cuda:1 --num_fewshot 5 --batch_size 2 --output_path $2/mmlu
lm_eval --model hf --model_args pretrained=$1 --tasks truthfulqa --device cuda:1 --num_fewshot 0 --batch_size 2 --output_path $2/truthfulqa
lm_eval --model hf --model_args pretrained=$1 --tasks winogrande --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/winogrande
lm_eval --model hf --model_args pretrained=$1 --tasks gsm8k --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/gsm8k

Open LLM Leaderboard org

Hi!
Did you make sure to follow the steps for reproduciblity in the About, and use the same lm_eval commit as we do?
The way evaluations are computed changed quite a lot in the harness across the last year.

Thank you for the answer :)

I checked lm_eval and it seems to be different from the leaderboard commit version, which is causing this issue.

jisukim8873 changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment