How can we reproduce the accuracy results?
#22
by
damoict
- opened
I have been struggling with reproducing the accuracy results for most of the tasks, from gpqa to gsm8k, to ifeval. I downloaded the repo to some folder in my local server and use lm-eval-harness's github repo, which includes most of the tasks and initiate the task with command like (for example for ifeval)
python -m lm_eval --model hf --model_args pretrained=/mnt/LLM_checkpoints/Meta-Llama-3.1-8B-Instruct/ --tasks ifeval --batch_size 16
However, the output results I got is the following and cannot match this repo has post. I also test other tasks and cannot reproduce their post result either.
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
ifeval | 4 | none | 0 | inst_level_loose_acc | ↑ | 0.6283 | ± | N/A |
none | 0 | inst_level_strict_acc | ↑ | 0.5863 | ± | N/A | ||
none | 0 | prompt_level_loose_acc | ↑ | 0.4972 | ± | 0.0215 | ||
none | 0 | prompt_level_strict_acc | ↑ | 0.4455 | ± | 0.0214 |
Can any one share how you reproduce the results? Thank you!