Different Scores

#8
by bezir - opened

Hello,

I can't produce the same scores for the google/gemma-2-9b-it model.

Run:
lm_eval --model vllm --model_args pretrained=google/gemma-2-9b-it --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2 --output /OUTPUT/PATH

lm_eval --model hf --model_args pretrained=google/gemma-2-9b-it --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2 --output /OUTPUT/PATH

2024-11-07:21:18:44,442 INFO [main.py:272] Verbosity set to INFO
2024-11-07:21:18:50,150 INFO [main.py:369] Selected Tasks: ['arc_tr-v0.2', 'gsm8k_tr-v0.2', 'hellaswag_tr-v0.2', 'mmlu_tr_v0.2', 'truthfulqa_v0.2', 'winogrande_tr-v0.2']
2024-11-07:21:18:50,153 INFO [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-11-07:21:18:50,153 INFO [evaluator.py:189] Initializing hf model, with arguments: {'pretrained': 'google/gemma-2-9b-it'}
2024-11-07:21:18:51,690 INFO [huggingface.py:169] Using device 'cuda'

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 1/4 [00:01<00:03, 1.20s/it]
Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:02<00:02, 1.00s/it]
Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:02<00:00, 1.05it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:03<00:00, 1.17it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:03<00:00, 1.09it/s]
2024-11-07:21:21:35,015 WARNING [task.py:981] Task 'winogrande_tr-v0.2': num_fewshot > 0 but fewshot_split is None. using preconfigured rule.
2024-11-07:21:21:35,016 WARNING [task.py:981] Task 'winogrande_tr-v0.2': num_fewshot > 0 but fewshot_split is None. using preconfigured rule.
2024-11-07:21:21:35,139 INFO [evaluator.py:261] Setting fewshot random generator seed to 1234

.... Process Here ...

PATH/lm-evaluation-harness_turkish/lm_eval/api/task.py:1376: RuntimeWarning: divide by zero encountered in divide
pred_norm = np.argmax(lls / completion_len)
2024-11-08:08:29:36,646 INFO [evaluation_tracker.py:182] Saving results aggregated
hf (pretrained=google/gemma-2-9b-it), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
arc_tr-v0.2 1 none 25 acc ↑ 0.4744 Β± 0.0146
none 25 acc_norm ↑ 0.5222 Β± 0.0146
gsm8k_tr-v0.2 Yaml flexible-extract 5 exact_match ↑ 0.1465 Β± 0.0097
strict-match 5 exact_match ↑ 0.6295 Β± 0.0133
hellaswag_tr-v0.2 Yaml none 10 acc ↑ 0.3992 Β± 0.0052
none 10 acc_norm ↑ 0.5145 Β± 0.0053
- abstract_algebra_v0.2 0 none 5 acc ↑ 0.3500 Β± 0.0479
- anatomy_v0.2 0 none 5 acc ↑ 0.5573 Β± 0.0436
- astronomy 0 none 0 acc ↑ 0.3113 Β± 0.0378
- business_ethics_v0.2 0 none 5 acc ↑ 0.5253 Β± 0.0504
- clinical_knowledge_v0.2 0 none 5 acc ↑ 0.5703 Β± 0.0310
- college_biology_v0.2 0 none 5 acc ↑ 0.6479 Β± 0.0402
- college_chemistry_v0.2 0 none 5 acc ↑ 0.4646 Β± 0.0504
- college_computer_science_v0.2 0 none 5 acc ↑ 0.4040 Β± 0.0496
- college_mathematics_v0.2 0 none 5 acc ↑ 0.3600 Β± 0.0482
- college_medicine_v0.2 0 none 5 acc ↑ 0.5000 Β± 0.0387
- college_physics_v0.2 0 none 5 acc ↑ 0.4257 Β± 0.0494
- computer_security_v0.2 0 none 5 acc ↑ 0.5200 Β± 0.0502
- conceptual_physics_v0.2 0 none 5 acc ↑ 0.4635 Β± 0.0327
- econometrics_v0.2 0 none 5 acc ↑ 0.3596 Β± 0.0451
- electrical_engineering_v0.2 0 none 5 acc ↑ 0.5625 Β± 0.0415
- elementary_mathematics_v0.2 0 none 5 acc ↑ 0.4504 Β± 0.0258
- formal_logic_v0.2 0 none 5 acc ↑ 0.3889 Β± 0.0436
- global_facts_v0.2 0 none 5 acc ↑ 0.3776 Β± 0.0492
- high_school_biology_v0.2 0 none 5 acc ↑ 0.7100 Β± 0.0262
- high_school_chemistry_v0.2 0 none 5 acc ↑ 0.5178 Β± 0.0357
- high_school_computer_science_v0.2 0 none 5 acc ↑ 0.6600 Β± 0.0476
- high_school_european_history_v0.2 0 none 5 acc ↑ 0.6333 Β± 0.0395
- high_school_geography_v0.2 0 none 5 acc ↑ 0.6701 Β± 0.0336
- high_school_government_and_politics_v0.2 0 none 5 acc ↑ 0.6203 Β± 0.0356
- high_school_macroeconomics_v0.2 0 none 5 acc ↑ 0.4974 Β± 0.0254
- high_school_mathematics_v0.2 0 none 5 acc ↑ 0.3296 Β± 0.0287
- high_school_microeconomics_v0.2 0 none 5 acc ↑ 0.5316 Β± 0.0325
- high_school_physics_v0.2 0 none 5 acc ↑ 0.3537 Β± 0.0396
- high_school_psychology_v0.2 0 none 5 acc ↑ 0.6904 Β± 0.0200
- high_school_statistics_v0.2 0 none 5 acc ↑ 0.3750 Β± 0.0330
- high_school_us_history_v0.2 0 none 5 acc ↑ 0.6536 Β± 0.0357
- high_school_world_history_v0.2 0 none 5 acc ↑ 0.7042 Β± 0.0313
- human_aging_v0.2 0 none 5 acc ↑ 0.5708 Β± 0.0341
- human_sexuality_v0.2 0 none 5 acc ↑ 0.6261 Β± 0.0453
- humanities_v0.2 N/A none 5 acc ↑ 0.4612 Β± 0.0071
- international_law_v0.2 0 none 5 acc ↑ 0.6777 Β± 0.0427
- jurisprudence_v0.2 0 none 5 acc ↑ 0.6132 Β± 0.0475
- logical_fallacies_v0.2 0 none 5 acc ↑ 0.4658 Β± 0.0394
- machine_learning_v0.2 0 none 5 acc ↑ 0.3839 Β± 0.0462
- management_v0.2 0 none 5 acc ↑ 0.6566 Β± 0.0480
- marketing_v0.2 0 none 5 acc ↑ 0.7143 Β± 0.0307
- medical_genetics_v0.2 0 none 5 acc ↑ 0.6737 Β± 0.0484
- miscellaneous_v0.2 0 none 5 acc ↑ 0.6789 Β± 0.0169
- moral_disputes_v0.2 0 none 5 acc ↑ 0.6299 Β± 0.0276
- moral_scenarios_v0.2 0 none 5 acc ↑ 0.2420 Β± 0.0145
- nutrition_v0.2 0 none 5 acc ↑ 0.6000 Β± 0.0281
- other_v0.2 N/A none 5 acc ↑ 0.5773 Β± 0.0088
- philosophy_v0.2 0 none 5 acc ↑ 0.5987 Β± 0.0284
- prehistory_v0.2 0 none 5 acc ↑ 0.6000 Β± 0.0283
- professional_accounting_v0.2 0 none 5 acc ↑ 0.3620 Β± 0.0288
- professional_law_v0.2 0 none 5 acc ↑ 0.3710 Β± 0.0130
- professional_medicine_v0.2 0 none 5 acc ↑ 0.5364 Β± 0.0309
- professional_psychology_v0.2 0 none 5 acc ↑ 0.5118 Β± 0.0205
- public_relations_v0.2 0 none 5 acc ↑ 0.5833 Β± 0.0477
- security_studies_v0.2 0 none 5 acc ↑ 0.6667 Β± 0.0309
- social_sciences_v0.2 N/A none 5 acc ↑ 0.5917 Β± 0.0088
- sociology_v0.2 0 none 5 acc ↑ 0.6718 Β± 0.0337
- stem_v0.2 N/A none 5 acc ↑ 0.4709 Β± 0.0087
mmlu_tr_v0.2 N/A none 0 acc ↑ 0.5183 Β± 0.0041
- us_foreign_policy_v0.2 0 none 5 acc ↑ 0.7475 Β± 0.0439
- virology_v0.2 0 none 5 acc ↑ 0.4528 Β± 0.0396
- world_religions_v0.2 0 none 5 acc ↑ 0.6726 Β± 0.0363
truthfulqa_v0.2 Yaml none 0 acc ↑ 0.5296 Β± 0.0158
winogrande_tr-v0.2 Yaml none 10 acc ↑ 0.5624 Β± 0.0139
Groups Version Filter n-shot Metric Value Stderr
- humanities_v0.2 N/A none 5 acc ↑ 0.4612 Β± 0.0071
- other_v0.2 N/A none 5 acc ↑ 0.5773 Β± 0.0088
- social_sciences_v0.2 N/A none 5 acc ↑ 0.5917 Β± 0.0088
- stem_v0.2 N/A none 5 acc ↑ 0.4709 Β± 0.0087
mmlu_tr_v0.2 N/A none 0 acc ↑ 0.5183 Β± 0.0041

Thank you for bringing this to my attention, I will reevaluate and rewrite here for update

The results are correct. not exactly sure why you are getting different results.
Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|

|--------------------------------------------|-------|----------------|-----:|-----------|---|-----:|---|-----:|
|arc_tr-v0.2 | 1|none | 25|acc |↑ |0.5179|Β± |0.0146|
| | |none | 25|acc_norm |↑ |0.5648|Β± |0.0145|
|gsm8k_tr-v0.2 |Yaml |flexible-extract| 5|exact_match|↑ |0.1443|Β± |0.0097|
| | |strict-match | 5|exact_match|↑ |0.6302|Β± |0.0133|
|hellaswag_tr-v0.2 |Yaml |none | 10|acc |↑ |0.4356|Β± |0.0053|
| | |none | 10|acc_norm |↑ |0.5650|Β± |0.0053|
| - abstract_algebra_v0.2 | 0|none | 5|acc |↑ |0.3500|Β± |0.0479|
| - anatomy_v0.2 | 0|none | 5|acc |↑ |0.6565|Β± |0.0416|
| - astronomy | 0|none | 0|acc |↑ |0.8158|Β± |0.0315|
| - business_ethics_v0.2 | 0|none | 5|acc |↑ |0.6465|Β± |0.0483|
| - clinical_knowledge_v0.2 | 0|none | 5|acc |↑ |0.6758|Β± |0.0293|
| - college_biology_v0.2 | 0|none | 5|acc |↑ |0.7254|Β± |0.0376|
| - college_chemistry_v0.2 | 0|none | 5|acc |↑ |0.5556|Β± |0.0502|
| - college_computer_science_v0.2 | 0|none | 5|acc |↑ |0.4444|Β± |0.0502|
| - college_mathematics_v0.2 | 0|none | 5|acc |↑ |0.4000|Β± |0.0492|
| - college_medicine_v0.2 | 0|none | 5|acc |↑ |0.6250|Β± |0.0375|
| - college_physics_v0.2 | 0|none | 5|acc |↑ |0.4752|Β± |0.0499|
| - computer_security_v0.2 | 0|none | 5|acc |↑ |0.6500|Β± |0.0479|
| - conceptual_physics_v0.2 | 0|none | 5|acc |↑ |0.5966|Β± |0.0322|
| - econometrics_v0.2 | 0|none | 5|acc |↑ |0.5351|Β± |0.0469|
| - electrical_engineering_v0.2 | 0|none | 5|acc |↑ |0.6042|Β± |0.0409|
| - elementary_mathematics_v0.2 | 0|none | 5|acc |↑ |0.5416|Β± |0.0258|
| - formal_logic_v0.2 | 0|none | 5|acc |↑ |0.4683|Β± |0.0446|
| - global_facts_v0.2 | 0|none | 5|acc |↑ |0.3980|Β± |0.0497|
| - high_school_biology_v0.2 | 0|none | 5|acc |↑ |0.8100|Β± |0.0227|
| - high_school_chemistry_v0.2 | 0|none | 5|acc |↑ |0.6244|Β± |0.0346|
| - high_school_computer_science_v0.2 | 0|none | 5|acc |↑ |0.7000|Β± |0.0461|
| - high_school_european_history_v0.2 | 0|none | 5|acc |↑ |0.7067|Β± |0.0373|
| - high_school_geography_v0.2 | 0|none | 5|acc |↑ |0.7310|Β± |0.0317|
| - high_school_government_and_politics_v0.2| 0|none | 5|acc |↑ |0.7112|Β± |0.0332|
| - high_school_macroeconomics_v0.2 | 0|none | 5|acc |↑ |0.6795|Β± |0.0237|
| - high_school_mathematics_v0.2 | 0|none | 5|acc |↑ |0.3704|Β± |0.0294|
| - high_school_microeconomics_v0.2 | 0|none | 5|acc |↑ |0.6245|Β± |0.0315|
| - high_school_physics_v0.2 | 0|none | 5|acc |↑ |0.4694|Β± |0.0413|
| - high_school_psychology_v0.2 | 0|none | 5|acc |↑ |0.7917|Β± |0.0176|
| - high_school_statistics_v0.2 | 0|none | 5|acc |↑ |0.4907|Β± |0.0341|
| - high_school_us_history_v0.2 | 0|none | 5|acc |↑ |0.7877|Β± |0.0307|
| - high_school_world_history_v0.2 | 0|none | 5|acc |↑ |0.7606|Β± |0.0293|
| - human_aging_v0.2 | 0|none | 5|acc |↑ |0.6887|Β± |0.0319|
| - human_sexuality_v0.2 | 0|none | 5|acc |↑ |0.6870|Β± |0.0434|
| - humanities_v0.2 |N/A |none | 5|acc |↑ |0.5448|Β± |0.0071|
| - international_law_v0.2 | 0|none | 5|acc |↑ |0.7769|Β± |0.0380|
| - jurisprudence_v0.2 | 0|none | 5|acc |↑ |0.7736|Β± |0.0408|
| - logical_fallacies_v0.2 | 0|none | 5|acc |↑ |0.6460|Β± |0.0378|
| - machine_learning_v0.2 | 0|none | 5|acc |↑ |0.4732|Β± |0.0474|
| - management_v0.2 | 0|none | 5|acc |↑ |0.7778|Β± |0.0420|
| - marketing_v0.2 | 0|none | 5|acc |↑ |0.8065|Β± |0.0269|
| - medical_genetics_v0.2 | 0|none | 5|acc |↑ |0.7263|Β± |0.0460|
| - miscellaneous_v0.2 | 0|none | 5|acc |↑ |0.7676|Β± |0.0153|
| - moral_disputes_v0.2 | 0|none | 5|acc |↑ |0.7078|Β± |0.0260|
| - moral_scenarios_v0.2 | 0|none | 5|acc |↑ |0.3372|Β± |0.0160|
| - nutrition_v0.2 | 0|none | 5|acc |↑ |0.6852|Β± |0.0266|
| - other_v0.2 |N/A |none | 5|acc |↑ |0.6642|Β± |0.0083|
| - philosophy_v0.2 | 0|none | 5|acc |↑ |0.6555|Β± |0.0275|
| - prehistory_v0.2 | 0|none | 5|acc |↑ |0.7167|Β± |0.0261|
| - professional_accounting_v0.2 | 0|none | 5|acc |↑ |0.3943|Β± |0.0293|
| - professional_law_v0.2 | 0|none | 5|acc |↑ |0.4294|Β± |0.0133|
| - professional_medicine_v0.2 | 0|none | 5|acc |↑ |0.6475|Β± |0.0296|
| - professional_psychology_v0.2 | 0|none | 5|acc |↑ |0.6212|Β± |0.0199|
| - public_relations_v0.2 | 0|none | 5|acc |↑ |0.6759|Β± |0.0452|
| - security_studies_v0.2 | 0|none | 5|acc |↑ |0.7308|Β± |0.0291|
| - social_sciences_v0.2 |N/A |none | 5|acc |↑ |0.6970|Β± |0.0083|
| - sociology_v0.2 | 0|none | 5|acc |↑ |0.7692|Β± |0.0302|
| - stem |N/A |none | 5|acc |↑ |0.5751|Β± |0.0085|
|mmlu_tr_v0.2 |N/A |none | 0|acc |↑ |0.6122|Β± |0.0040|
| - us_foreign_policy_v0.2 | 0|none | 5|acc |↑ |0.7879|Β± |0.0413|
| - virology_v0.2 | 0|none | 5|acc |↑ |0.4906|Β± |0.0398|
| - world_religions_v0.2 | 0|none | 5|acc |↑ |0.7440|Β± |0.0338|
|truthfulqa_v0.2 |Yaml |none | 0|acc |↑ |0.5584|Β± |0.0158|
|winogrande_tr-v0.2 |Yaml |none | 10|acc |↑ |0.6201|Β± |0.0136|

Groups Version Filter n-shot Metric Value Stderr
- humanities_v0.2 N/A none 5 acc ↑ 0.5448 Β± 0.0071
- other_v0.2 N/A none 5 acc ↑ 0.6642 Β± 0.0083
- social_sciences_v0.2 N/A none 5 acc ↑ 0.6970 Β± 0.0083
- stem N/A none 5 acc ↑ 0.5751 Β± 0.0085
mmlu_tr_v0.2 N/A none 0 acc ↑ 0.6122 Β± 0.0040

Excution command: lm_eval --model vllm --model_args "pretrained=google/gemma-2-9b-it,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.6,data_parallel_size=4,max_model_len=3048" --tasks "mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2" --output_path "/home/malhajar" --batch_size=auto
I am using tensor parallelisim to evaluate.

malhajar changed discussion status to closed

@malhajar Hello,

Thanks for reevaluation. I think it is because I use --model hf instead of vllm. This probably causing a problem. I realized that Huggingface does not use any inference engine as well in their leaderboard. It should be considered.

Sign up or log in to comment