Different Scores
Hello,
I can't produce the same scores for the google/gemma-2-9b-it model.
Run:
lm_eval --model vllm --model_args pretrained=google/gemma-2-9b-it --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2 --output /OUTPUT/PATH
lm_eval --model hf --model_args pretrained=google/gemma-2-9b-it --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2 --output /OUTPUT/PATH
2024-11-07:21:18:44,442 INFO [main.py:272] Verbosity set to INFO
2024-11-07:21:18:50,150 INFO [main.py:369] Selected Tasks: ['arc_tr-v0.2', 'gsm8k_tr-v0.2', 'hellaswag_tr-v0.2', 'mmlu_tr_v0.2', 'truthfulqa_v0.2', 'winogrande_tr-v0.2']
2024-11-07:21:18:50,153 INFO [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-11-07:21:18:50,153 INFO [evaluator.py:189] Initializing hf model, with arguments: {'pretrained': 'google/gemma-2-9b-it'}
2024-11-07:21:18:51,690 INFO [huggingface.py:169] Using device 'cuda'
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|βββ | 1/4 [00:01<00:03, 1.20s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:02<00:02, 1.00s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:02<00:00, 1.05it/s]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:03<00:00, 1.17it/s]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:03<00:00, 1.09it/s]
2024-11-07:21:21:35,015 WARNING [task.py:981] Task 'winogrande_tr-v0.2': num_fewshot > 0 but fewshot_split is None. using preconfigured rule.
2024-11-07:21:21:35,016 WARNING [task.py:981] Task 'winogrande_tr-v0.2': num_fewshot > 0 but fewshot_split is None. using preconfigured rule.
2024-11-07:21:21:35,139 INFO [evaluator.py:261] Setting fewshot random generator seed to 1234
.... Process Here ...
PATH/lm-evaluation-harness_turkish/lm_eval/api/task.py:1376: RuntimeWarning: divide by zero encountered in divide
pred_norm = np.argmax(lls / completion_len)
2024-11-08:08:29:36,646 INFO [evaluation_tracker.py:182] Saving results aggregated
hf (pretrained=google/gemma-2-9b-it), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
arc_tr-v0.2 | 1 | none | 25 | acc | β | 0.4744 | Β± | 0.0146 |
none | 25 | acc_norm | β | 0.5222 | Β± | 0.0146 | ||
gsm8k_tr-v0.2 | Yaml | flexible-extract | 5 | exact_match | β | 0.1465 | Β± | 0.0097 |
strict-match | 5 | exact_match | β | 0.6295 | Β± | 0.0133 | ||
hellaswag_tr-v0.2 | Yaml | none | 10 | acc | β | 0.3992 | Β± | 0.0052 |
none | 10 | acc_norm | β | 0.5145 | Β± | 0.0053 | ||
- abstract_algebra_v0.2 | 0 | none | 5 | acc | β | 0.3500 | Β± | 0.0479 |
- anatomy_v0.2 | 0 | none | 5 | acc | β | 0.5573 | Β± | 0.0436 |
- astronomy | 0 | none | 0 | acc | β | 0.3113 | Β± | 0.0378 |
- business_ethics_v0.2 | 0 | none | 5 | acc | β | 0.5253 | Β± | 0.0504 |
- clinical_knowledge_v0.2 | 0 | none | 5 | acc | β | 0.5703 | Β± | 0.0310 |
- college_biology_v0.2 | 0 | none | 5 | acc | β | 0.6479 | Β± | 0.0402 |
- college_chemistry_v0.2 | 0 | none | 5 | acc | β | 0.4646 | Β± | 0.0504 |
- college_computer_science_v0.2 | 0 | none | 5 | acc | β | 0.4040 | Β± | 0.0496 |
- college_mathematics_v0.2 | 0 | none | 5 | acc | β | 0.3600 | Β± | 0.0482 |
- college_medicine_v0.2 | 0 | none | 5 | acc | β | 0.5000 | Β± | 0.0387 |
- college_physics_v0.2 | 0 | none | 5 | acc | β | 0.4257 | Β± | 0.0494 |
- computer_security_v0.2 | 0 | none | 5 | acc | β | 0.5200 | Β± | 0.0502 |
- conceptual_physics_v0.2 | 0 | none | 5 | acc | β | 0.4635 | Β± | 0.0327 |
- econometrics_v0.2 | 0 | none | 5 | acc | β | 0.3596 | Β± | 0.0451 |
- electrical_engineering_v0.2 | 0 | none | 5 | acc | β | 0.5625 | Β± | 0.0415 |
- elementary_mathematics_v0.2 | 0 | none | 5 | acc | β | 0.4504 | Β± | 0.0258 |
- formal_logic_v0.2 | 0 | none | 5 | acc | β | 0.3889 | Β± | 0.0436 |
- global_facts_v0.2 | 0 | none | 5 | acc | β | 0.3776 | Β± | 0.0492 |
- high_school_biology_v0.2 | 0 | none | 5 | acc | β | 0.7100 | Β± | 0.0262 |
- high_school_chemistry_v0.2 | 0 | none | 5 | acc | β | 0.5178 | Β± | 0.0357 |
- high_school_computer_science_v0.2 | 0 | none | 5 | acc | β | 0.6600 | Β± | 0.0476 |
- high_school_european_history_v0.2 | 0 | none | 5 | acc | β | 0.6333 | Β± | 0.0395 |
- high_school_geography_v0.2 | 0 | none | 5 | acc | β | 0.6701 | Β± | 0.0336 |
- high_school_government_and_politics_v0.2 | 0 | none | 5 | acc | β | 0.6203 | Β± | 0.0356 |
- high_school_macroeconomics_v0.2 | 0 | none | 5 | acc | β | 0.4974 | Β± | 0.0254 |
- high_school_mathematics_v0.2 | 0 | none | 5 | acc | β | 0.3296 | Β± | 0.0287 |
- high_school_microeconomics_v0.2 | 0 | none | 5 | acc | β | 0.5316 | Β± | 0.0325 |
- high_school_physics_v0.2 | 0 | none | 5 | acc | β | 0.3537 | Β± | 0.0396 |
- high_school_psychology_v0.2 | 0 | none | 5 | acc | β | 0.6904 | Β± | 0.0200 |
- high_school_statistics_v0.2 | 0 | none | 5 | acc | β | 0.3750 | Β± | 0.0330 |
- high_school_us_history_v0.2 | 0 | none | 5 | acc | β | 0.6536 | Β± | 0.0357 |
- high_school_world_history_v0.2 | 0 | none | 5 | acc | β | 0.7042 | Β± | 0.0313 |
- human_aging_v0.2 | 0 | none | 5 | acc | β | 0.5708 | Β± | 0.0341 |
- human_sexuality_v0.2 | 0 | none | 5 | acc | β | 0.6261 | Β± | 0.0453 |
- humanities_v0.2 | N/A | none | 5 | acc | β | 0.4612 | Β± | 0.0071 |
- international_law_v0.2 | 0 | none | 5 | acc | β | 0.6777 | Β± | 0.0427 |
- jurisprudence_v0.2 | 0 | none | 5 | acc | β | 0.6132 | Β± | 0.0475 |
- logical_fallacies_v0.2 | 0 | none | 5 | acc | β | 0.4658 | Β± | 0.0394 |
- machine_learning_v0.2 | 0 | none | 5 | acc | β | 0.3839 | Β± | 0.0462 |
- management_v0.2 | 0 | none | 5 | acc | β | 0.6566 | Β± | 0.0480 |
- marketing_v0.2 | 0 | none | 5 | acc | β | 0.7143 | Β± | 0.0307 |
- medical_genetics_v0.2 | 0 | none | 5 | acc | β | 0.6737 | Β± | 0.0484 |
- miscellaneous_v0.2 | 0 | none | 5 | acc | β | 0.6789 | Β± | 0.0169 |
- moral_disputes_v0.2 | 0 | none | 5 | acc | β | 0.6299 | Β± | 0.0276 |
- moral_scenarios_v0.2 | 0 | none | 5 | acc | β | 0.2420 | Β± | 0.0145 |
- nutrition_v0.2 | 0 | none | 5 | acc | β | 0.6000 | Β± | 0.0281 |
- other_v0.2 | N/A | none | 5 | acc | β | 0.5773 | Β± | 0.0088 |
- philosophy_v0.2 | 0 | none | 5 | acc | β | 0.5987 | Β± | 0.0284 |
- prehistory_v0.2 | 0 | none | 5 | acc | β | 0.6000 | Β± | 0.0283 |
- professional_accounting_v0.2 | 0 | none | 5 | acc | β | 0.3620 | Β± | 0.0288 |
- professional_law_v0.2 | 0 | none | 5 | acc | β | 0.3710 | Β± | 0.0130 |
- professional_medicine_v0.2 | 0 | none | 5 | acc | β | 0.5364 | Β± | 0.0309 |
- professional_psychology_v0.2 | 0 | none | 5 | acc | β | 0.5118 | Β± | 0.0205 |
- public_relations_v0.2 | 0 | none | 5 | acc | β | 0.5833 | Β± | 0.0477 |
- security_studies_v0.2 | 0 | none | 5 | acc | β | 0.6667 | Β± | 0.0309 |
- social_sciences_v0.2 | N/A | none | 5 | acc | β | 0.5917 | Β± | 0.0088 |
- sociology_v0.2 | 0 | none | 5 | acc | β | 0.6718 | Β± | 0.0337 |
- stem_v0.2 | N/A | none | 5 | acc | β | 0.4709 | Β± | 0.0087 |
mmlu_tr_v0.2 | N/A | none | 0 | acc | β | 0.5183 | Β± | 0.0041 |
- us_foreign_policy_v0.2 | 0 | none | 5 | acc | β | 0.7475 | Β± | 0.0439 |
- virology_v0.2 | 0 | none | 5 | acc | β | 0.4528 | Β± | 0.0396 |
- world_religions_v0.2 | 0 | none | 5 | acc | β | 0.6726 | Β± | 0.0363 |
truthfulqa_v0.2 | Yaml | none | 0 | acc | β | 0.5296 | Β± | 0.0158 |
winogrande_tr-v0.2 | Yaml | none | 10 | acc | β | 0.5624 | Β± | 0.0139 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
- humanities_v0.2 | N/A | none | 5 | acc | β | 0.4612 | Β± | 0.0071 |
- other_v0.2 | N/A | none | 5 | acc | β | 0.5773 | Β± | 0.0088 |
- social_sciences_v0.2 | N/A | none | 5 | acc | β | 0.5917 | Β± | 0.0088 |
- stem_v0.2 | N/A | none | 5 | acc | β | 0.4709 | Β± | 0.0087 |
mmlu_tr_v0.2 | N/A | none | 0 | acc | β | 0.5183 | Β± | 0.0041 |
Thank you for bringing this to my attention, I will reevaluate and rewrite here for update
The results are correct. not exactly sure why you are getting different results.
Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|--------------------------------------------|-------|----------------|-----:|-----------|---|-----:|---|-----:|
|arc_tr-v0.2 | 1|none | 25|acc |β |0.5179|Β± |0.0146|
| | |none | 25|acc_norm |β |0.5648|Β± |0.0145|
|gsm8k_tr-v0.2 |Yaml |flexible-extract| 5|exact_match|β |0.1443|Β± |0.0097|
| | |strict-match | 5|exact_match|β |0.6302|Β± |0.0133|
|hellaswag_tr-v0.2 |Yaml |none | 10|acc |β |0.4356|Β± |0.0053|
| | |none | 10|acc_norm |β |0.5650|Β± |0.0053|
| - abstract_algebra_v0.2 | 0|none | 5|acc |β |0.3500|Β± |0.0479|
| - anatomy_v0.2 | 0|none | 5|acc |β |0.6565|Β± |0.0416|
| - astronomy | 0|none | 0|acc |β |0.8158|Β± |0.0315|
| - business_ethics_v0.2 | 0|none | 5|acc |β |0.6465|Β± |0.0483|
| - clinical_knowledge_v0.2 | 0|none | 5|acc |β |0.6758|Β± |0.0293|
| - college_biology_v0.2 | 0|none | 5|acc |β |0.7254|Β± |0.0376|
| - college_chemistry_v0.2 | 0|none | 5|acc |β |0.5556|Β± |0.0502|
| - college_computer_science_v0.2 | 0|none | 5|acc |β |0.4444|Β± |0.0502|
| - college_mathematics_v0.2 | 0|none | 5|acc |β |0.4000|Β± |0.0492|
| - college_medicine_v0.2 | 0|none | 5|acc |β |0.6250|Β± |0.0375|
| - college_physics_v0.2 | 0|none | 5|acc |β |0.4752|Β± |0.0499|
| - computer_security_v0.2 | 0|none | 5|acc |β |0.6500|Β± |0.0479|
| - conceptual_physics_v0.2 | 0|none | 5|acc |β |0.5966|Β± |0.0322|
| - econometrics_v0.2 | 0|none | 5|acc |β |0.5351|Β± |0.0469|
| - electrical_engineering_v0.2 | 0|none | 5|acc |β |0.6042|Β± |0.0409|
| - elementary_mathematics_v0.2 | 0|none | 5|acc |β |0.5416|Β± |0.0258|
| - formal_logic_v0.2 | 0|none | 5|acc |β |0.4683|Β± |0.0446|
| - global_facts_v0.2 | 0|none | 5|acc |β |0.3980|Β± |0.0497|
| - high_school_biology_v0.2 | 0|none | 5|acc |β |0.8100|Β± |0.0227|
| - high_school_chemistry_v0.2 | 0|none | 5|acc |β |0.6244|Β± |0.0346|
| - high_school_computer_science_v0.2 | 0|none | 5|acc |β |0.7000|Β± |0.0461|
| - high_school_european_history_v0.2 | 0|none | 5|acc |β |0.7067|Β± |0.0373|
| - high_school_geography_v0.2 | 0|none | 5|acc |β |0.7310|Β± |0.0317|
| - high_school_government_and_politics_v0.2| 0|none | 5|acc |β |0.7112|Β± |0.0332|
| - high_school_macroeconomics_v0.2 | 0|none | 5|acc |β |0.6795|Β± |0.0237|
| - high_school_mathematics_v0.2 | 0|none | 5|acc |β |0.3704|Β± |0.0294|
| - high_school_microeconomics_v0.2 | 0|none | 5|acc |β |0.6245|Β± |0.0315|
| - high_school_physics_v0.2 | 0|none | 5|acc |β |0.4694|Β± |0.0413|
| - high_school_psychology_v0.2 | 0|none | 5|acc |β |0.7917|Β± |0.0176|
| - high_school_statistics_v0.2 | 0|none | 5|acc |β |0.4907|Β± |0.0341|
| - high_school_us_history_v0.2 | 0|none | 5|acc |β |0.7877|Β± |0.0307|
| - high_school_world_history_v0.2 | 0|none | 5|acc |β |0.7606|Β± |0.0293|
| - human_aging_v0.2 | 0|none | 5|acc |β |0.6887|Β± |0.0319|
| - human_sexuality_v0.2 | 0|none | 5|acc |β |0.6870|Β± |0.0434|
| - humanities_v0.2 |N/A |none | 5|acc |β |0.5448|Β± |0.0071|
| - international_law_v0.2 | 0|none | 5|acc |β |0.7769|Β± |0.0380|
| - jurisprudence_v0.2 | 0|none | 5|acc |β |0.7736|Β± |0.0408|
| - logical_fallacies_v0.2 | 0|none | 5|acc |β |0.6460|Β± |0.0378|
| - machine_learning_v0.2 | 0|none | 5|acc |β |0.4732|Β± |0.0474|
| - management_v0.2 | 0|none | 5|acc |β |0.7778|Β± |0.0420|
| - marketing_v0.2 | 0|none | 5|acc |β |0.8065|Β± |0.0269|
| - medical_genetics_v0.2 | 0|none | 5|acc |β |0.7263|Β± |0.0460|
| - miscellaneous_v0.2 | 0|none | 5|acc |β |0.7676|Β± |0.0153|
| - moral_disputes_v0.2 | 0|none | 5|acc |β |0.7078|Β± |0.0260|
| - moral_scenarios_v0.2 | 0|none | 5|acc |β |0.3372|Β± |0.0160|
| - nutrition_v0.2 | 0|none | 5|acc |β |0.6852|Β± |0.0266|
| - other_v0.2 |N/A |none | 5|acc |β |0.6642|Β± |0.0083|
| - philosophy_v0.2 | 0|none | 5|acc |β |0.6555|Β± |0.0275|
| - prehistory_v0.2 | 0|none | 5|acc |β |0.7167|Β± |0.0261|
| - professional_accounting_v0.2 | 0|none | 5|acc |β |0.3943|Β± |0.0293|
| - professional_law_v0.2 | 0|none | 5|acc |β |0.4294|Β± |0.0133|
| - professional_medicine_v0.2 | 0|none | 5|acc |β |0.6475|Β± |0.0296|
| - professional_psychology_v0.2 | 0|none | 5|acc |β |0.6212|Β± |0.0199|
| - public_relations_v0.2 | 0|none | 5|acc |β |0.6759|Β± |0.0452|
| - security_studies_v0.2 | 0|none | 5|acc |β |0.7308|Β± |0.0291|
| - social_sciences_v0.2 |N/A |none | 5|acc |β |0.6970|Β± |0.0083|
| - sociology_v0.2 | 0|none | 5|acc |β |0.7692|Β± |0.0302|
| - stem |N/A |none | 5|acc |β |0.5751|Β± |0.0085|
|mmlu_tr_v0.2 |N/A |none | 0|acc |β |0.6122|Β± |0.0040|
| - us_foreign_policy_v0.2 | 0|none | 5|acc |β |0.7879|Β± |0.0413|
| - virology_v0.2 | 0|none | 5|acc |β |0.4906|Β± |0.0398|
| - world_religions_v0.2 | 0|none | 5|acc |β |0.7440|Β± |0.0338|
|truthfulqa_v0.2 |Yaml |none | 0|acc |β |0.5584|Β± |0.0158|
|winogrande_tr-v0.2 |Yaml |none | 10|acc |β |0.6201|Β± |0.0136|
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
- humanities_v0.2 | N/A | none | 5 | acc | β | 0.5448 | Β± | 0.0071 |
- other_v0.2 | N/A | none | 5 | acc | β | 0.6642 | Β± | 0.0083 |
- social_sciences_v0.2 | N/A | none | 5 | acc | β | 0.6970 | Β± | 0.0083 |
- stem | N/A | none | 5 | acc | β | 0.5751 | Β± | 0.0085 |
mmlu_tr_v0.2 | N/A | none | 0 | acc | β | 0.6122 | Β± | 0.0040 |
Excution command: lm_eval --model vllm --model_args "pretrained=google/gemma-2-9b-it,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.6,data_parallel_size=4,max_model_len=3048" --tasks "mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2" --output_path "/home/malhajar" --batch_size=auto
I am using tensor parallelisim to evaluate.
@malhajar Hello,
Thanks for reevaluation. I think it is because I use --model hf instead of vllm. This probably causing a problem. I realized that Huggingface does not use any inference engine as well in their leaderboard. It should be considered.