How can we reproduce the accuracy results?

#22
by damoict - opened

I have been struggling with reproducing the accuracy results for most of the tasks, from gpqa to gsm8k, to ifeval. I downloaded the repo to some folder in my local server and use lm-eval-harness's github repo, which includes most of the tasks and initiate the task with command like (for example for ifeval)

python -m lm_eval --model hf --model_args pretrained=/mnt/LLM_checkpoints/Meta-Llama-3.1-8B-Instruct/ --tasks ifeval --batch_size 16

However, the output results I got is the following and cannot match this repo has post. I also test other tasks and cannot reproduce their post result either.

Tasks Version Filter n-shot Metric Value Stderr
ifeval 4 none 0 inst_level_loose_acc 0.6283 ± N/A
none 0 inst_level_strict_acc 0.5863 ± N/A
none 0 prompt_level_loose_acc 0.4972 ± 0.0215
none 0 prompt_level_strict_acc 0.4455 ± 0.0214

Can any one share how you reproduce the results? Thank you!

Sign up or log in to comment