meta-llama/Llama-3.3-70B-Instruct · How can we reproduce the accuracy results?

I have been struggling with reproducing the accuracy results for most of the tasks, from gpqa to gsm8k, to ifeval. I downloaded the repo to some folder in my local server and use lm-eval-harness's github repo, which includes most of the tasks and initiate the task with command like (for example for ifeval)

python -m lm_eval --model hf --model_args pretrained=/mnt/LLM_checkpoints/Meta-Llama-3.1-8B-Instruct/ --tasks ifeval --batch_size 16

However, the output results I got is the following and cannot match this repo has post. I also test other tasks and cannot reproduce their post result either.

Tasks	Version	Filter	Metric		Value		Stderr
ifeval	4	none	inst_level_loose_acc	↑	0.6283	±	N/A
		none	inst_level_strict_acc	↑	0.5863	±	N/A
		none	prompt_level_loose_acc	↑	0.4972	±	0.0215
		none	prompt_level_strict_acc	↑	0.4455	±	0.0214

Can any one share how you reproduce the results? Thank you!