malhajar/OpenLLMTurkishLeaderboard_v0.2

Nov 22

Hello,

I can't produce the same scores for the google/gemma-2-9b-it model.

Run:
lm_eval --model vllm --model_args pretrained=google/gemma-2-9b-it --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2 --output /OUTPUT/PATH

lm_eval --model hf --model_args pretrained=google/gemma-2-9b-it --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2 --output /OUTPUT/PATH

2024-11-07:21:18:44,442 INFO [main.py:272] Verbosity set to INFO
2024-11-07:21:18:50,150 INFO [main.py:369] Selected Tasks: ['arc_tr-v0.2', 'gsm8k_tr-v0.2', 'hellaswag_tr-v0.2', 'mmlu_tr_v0.2', 'truthfulqa_v0.2', 'winogrande_tr-v0.2']
2024-11-07:21:18:50,153 INFO [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-11-07:21:18:50,153 INFO [evaluator.py:189] Initializing hf model, with arguments: {'pretrained': 'google/gemma-2-9b-it'}
2024-11-07:21:18:51,690 INFO [huggingface.py:169] Using device 'cuda'

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:01<00:03, 1.20s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:02<00:02, 1.00s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.05it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.17it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.09it/s]
2024-11-07:21:21:35,015 WARNING [task.py:981] Task 'winogrande_tr-v0.2': num_fewshot > 0 but fewshot_split is None. using preconfigured rule.
2024-11-07:21:21:35,016 WARNING [task.py:981] Task 'winogrande_tr-v0.2': num_fewshot > 0 but fewshot_split is None. using preconfigured rule.
2024-11-07:21:21:35,139 INFO [evaluator.py:261] Setting fewshot random generator seed to 1234

.... Process Here ...

PATH/lm-evaluation-harness_turkish/lm_eval/api/task.py:1376: RuntimeWarning: divide by zero encountered in divide
pred_norm = np.argmax(lls / completion_len)
2024-11-08:08:29:36,646 INFO [evaluation_tracker.py:182] Saving results aggregated
hf (pretrained=google/gemma-2-9b-it), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_tr-v0.2	1	none	25	acc	↑	0.4744	±	0.0146
		none	25	acc_norm	↑	0.5222	±	0.0146
gsm8k_tr-v0.2	Yaml	flexible-extract	5	exact_match	↑	0.1465	±	0.0097
		strict-match	5	exact_match	↑	0.6295	±	0.0133
hellaswag_tr-v0.2	Yaml	none	10	acc	↑	0.3992	±	0.0052
		none	10	acc_norm	↑	0.5145	±	0.0053
- abstract_algebra_v0.2	0	none	5	acc	↑	0.3500	±	0.0479
- anatomy_v0.2	0	none	5	acc	↑	0.5573	±	0.0436
- astronomy	0	none	0	acc	↑	0.3113	±	0.0378
- business_ethics_v0.2	0	none	5	acc	↑	0.5253	±	0.0504
- clinical_knowledge_v0.2	0	none	5	acc	↑	0.5703	±	0.0310
- college_biology_v0.2	0	none	5	acc	↑	0.6479	±	0.0402
- college_chemistry_v0.2	0	none	5	acc	↑	0.4646	±	0.0504
- college_computer_science_v0.2	0	none	5	acc	↑	0.4040	±	0.0496
- college_mathematics_v0.2	0	none	5	acc	↑	0.3600	±	0.0482
- college_medicine_v0.2	0	none	5	acc	↑	0.5000	±	0.0387
- college_physics_v0.2	0	none	5	acc	↑	0.4257	±	0.0494
- computer_security_v0.2	0	none	5	acc	↑	0.5200	±	0.0502
- conceptual_physics_v0.2	0	none	5	acc	↑	0.4635	±	0.0327
- econometrics_v0.2	0	none	5	acc	↑	0.3596	±	0.0451
- electrical_engineering_v0.2	0	none	5	acc	↑	0.5625	±	0.0415
- elementary_mathematics_v0.2	0	none	5	acc	↑	0.4504	±	0.0258
- formal_logic_v0.2	0	none	5	acc	↑	0.3889	±	0.0436
- global_facts_v0.2	0	none	5	acc	↑	0.3776	±	0.0492
- high_school_biology_v0.2	0	none	5	acc	↑	0.7100	±	0.0262
- high_school_chemistry_v0.2	0	none	5	acc	↑	0.5178	±	0.0357
- high_school_computer_science_v0.2	0	none	5	acc	↑	0.6600	±	0.0476
- high_school_european_history_v0.2	0	none	5	acc	↑	0.6333	±	0.0395
- high_school_geography_v0.2	0	none	5	acc	↑	0.6701	±	0.0336
- high_school_government_and_politics_v0.2	0	none	5	acc	↑	0.6203	±	0.0356
- high_school_macroeconomics_v0.2	0	none	5	acc	↑	0.4974	±	0.0254
- high_school_mathematics_v0.2	0	none	5	acc	↑	0.3296	±	0.0287
- high_school_microeconomics_v0.2	0	none	5	acc	↑	0.5316	±	0.0325
- high_school_physics_v0.2	0	none	5	acc	↑	0.3537	±	0.0396
- high_school_psychology_v0.2	0	none	5	acc	↑	0.6904	±	0.0200
- high_school_statistics_v0.2	0	none	5	acc	↑	0.3750	±	0.0330
- high_school_us_history_v0.2	0	none	5	acc	↑	0.6536	±	0.0357
- high_school_world_history_v0.2	0	none	5	acc	↑	0.7042	±	0.0313
- human_aging_v0.2	0	none	5	acc	↑	0.5708	±	0.0341
- human_sexuality_v0.2	0	none	5	acc	↑	0.6261	±	0.0453
- humanities_v0.2	N/A	none	5	acc	↑	0.4612	±	0.0071
- international_law_v0.2	0	none	5	acc	↑	0.6777	±	0.0427
- jurisprudence_v0.2	0	none	5	acc	↑	0.6132	±	0.0475
- logical_fallacies_v0.2	0	none	5	acc	↑	0.4658	±	0.0394
- machine_learning_v0.2	0	none	5	acc	↑	0.3839	±	0.0462
- management_v0.2	0	none	5	acc	↑	0.6566	±	0.0480
- marketing_v0.2	0	none	5	acc	↑	0.7143	±	0.0307
- medical_genetics_v0.2	0	none	5	acc	↑	0.6737	±	0.0484
- miscellaneous_v0.2	0	none	5	acc	↑	0.6789	±	0.0169
- moral_disputes_v0.2	0	none	5	acc	↑	0.6299	±	0.0276
- moral_scenarios_v0.2	0	none	5	acc	↑	0.2420	±	0.0145
- nutrition_v0.2	0	none	5	acc	↑	0.6000	±	0.0281
- other_v0.2	N/A	none	5	acc	↑	0.5773	±	0.0088
- philosophy_v0.2	0	none	5	acc	↑	0.5987	±	0.0284
- prehistory_v0.2	0	none	5	acc	↑	0.6000	±	0.0283
- professional_accounting_v0.2	0	none	5	acc	↑	0.3620	±	0.0288
- professional_law_v0.2	0	none	5	acc	↑	0.3710	±	0.0130
- professional_medicine_v0.2	0	none	5	acc	↑	0.5364	±	0.0309
- professional_psychology_v0.2	0	none	5	acc	↑	0.5118	±	0.0205
- public_relations_v0.2	0	none	5	acc	↑	0.5833	±	0.0477
- security_studies_v0.2	0	none	5	acc	↑	0.6667	±	0.0309
- social_sciences_v0.2	N/A	none	5	acc	↑	0.5917	±	0.0088
- sociology_v0.2	0	none	5	acc	↑	0.6718	±	0.0337
- stem_v0.2	N/A	none	5	acc	↑	0.4709	±	0.0087
mmlu_tr_v0.2	N/A	none	0	acc	↑	0.5183	±	0.0041
- us_foreign_policy_v0.2	0	none	5	acc	↑	0.7475	±	0.0439
- virology_v0.2	0	none	5	acc	↑	0.4528	±	0.0396
- world_religions_v0.2	0	none	5	acc	↑	0.6726	±	0.0363
truthfulqa_v0.2	Yaml	none	0	acc	↑	0.5296	±	0.0158
winogrande_tr-v0.2	Yaml	none	10	acc	↑	0.5624	±	0.0139

Groups	Version	Filter	n-shot	Metric		Value		Stderr
- humanities_v0.2	N/A	none	5	acc	↑	0.4612	±	0.0071
- other_v0.2	N/A	none	5	acc	↑	0.5773	±	0.0088
- social_sciences_v0.2	N/A	none	5	acc	↑	0.5917	±	0.0088
- stem_v0.2	N/A	none	5	acc	↑	0.4709	±	0.0087
mmlu_tr_v0.2	N/A	none	0	acc	↑	0.5183	±	0.0041

malhajar

Owner Nov 22

Thank you for bringing this to my attention, I will reevaluate and rewrite here for update

malhajar

Owner Nov 23

|--------------------------------------------|-------|----------------|-----:|-----------|---|-----:|---|-----:|
|arc_tr-v0.2 | 1|none | 25|acc |↑ |0.5179|± |0.0146|
| | |none | 25|acc_norm |↑ |0.5648|± |0.0145|
|gsm8k_tr-v0.2 |Yaml |flexible-extract| 5|exact_match|↑ |0.1443|± |0.0097|
| | |strict-match | 5|exact_match|↑ |0.6302|± |0.0133|
|hellaswag_tr-v0.2 |Yaml |none | 10|acc |↑ |0.4356|± |0.0053|
| | |none | 10|acc_norm |↑ |0.5650|± |0.0053|
| - abstract_algebra_v0.2 | 0|none | 5|acc |↑ |0.3500|± |0.0479|
| - anatomy_v0.2 | 0|none | 5|acc |↑ |0.6565|± |0.0416|
| - astronomy | 0|none | 0|acc |↑ |0.8158|± |0.0315|
| - business_ethics_v0.2 | 0|none | 5|acc |↑ |0.6465|± |0.0483|
| - clinical_knowledge_v0.2 | 0|none | 5|acc |↑ |0.6758|± |0.0293|
| - college_biology_v0.2 | 0|none | 5|acc |↑ |0.7254|± |0.0376|
| - college_chemistry_v0.2 | 0|none | 5|acc |↑ |0.5556|± |0.0502|
| - college_computer_science_v0.2 | 0|none | 5|acc |↑ |0.4444|± |0.0502|
| - college_mathematics_v0.2 | 0|none | 5|acc |↑ |0.4000|± |0.0492|
| - college_medicine_v0.2 | 0|none | 5|acc |↑ |0.6250|± |0.0375|
| - college_physics_v0.2 | 0|none | 5|acc |↑ |0.4752|± |0.0499|
| - computer_security_v0.2 | 0|none | 5|acc |↑ |0.6500|± |0.0479|
| - conceptual_physics_v0.2 | 0|none | 5|acc |↑ |0.5966|± |0.0322|
| - econometrics_v0.2 | 0|none | 5|acc |↑ |0.5351|± |0.0469|
| - electrical_engineering_v0.2 | 0|none | 5|acc |↑ |0.6042|± |0.0409|
| - elementary_mathematics_v0.2 | 0|none | 5|acc |↑ |0.5416|± |0.0258|
| - formal_logic_v0.2 | 0|none | 5|acc |↑ |0.4683|± |0.0446|
| - global_facts_v0.2 | 0|none | 5|acc |↑ |0.3980|± |0.0497|
| - high_school_biology_v0.2 | 0|none | 5|acc |↑ |0.8100|± |0.0227|
| - high_school_chemistry_v0.2 | 0|none | 5|acc |↑ |0.6244|± |0.0346|
| - high_school_computer_science_v0.2 | 0|none | 5|acc |↑ |0.7000|± |0.0461|
| - high_school_european_history_v0.2 | 0|none | 5|acc |↑ |0.7067|± |0.0373|
| - high_school_geography_v0.2 | 0|none | 5|acc |↑ |0.7310|± |0.0317|
| - high_school_government_and_politics_v0.2| 0|none | 5|acc |↑ |0.7112|± |0.0332|
| - high_school_macroeconomics_v0.2 | 0|none | 5|acc |↑ |0.6795|± |0.0237|
| - high_school_mathematics_v0.2 | 0|none | 5|acc |↑ |0.3704|± |0.0294|
| - high_school_microeconomics_v0.2 | 0|none | 5|acc |↑ |0.6245|± |0.0315|
| - high_school_physics_v0.2 | 0|none | 5|acc |↑ |0.4694|± |0.0413|
| - high_school_psychology_v0.2 | 0|none | 5|acc |↑ |0.7917|± |0.0176|
| - high_school_statistics_v0.2 | 0|none | 5|acc |↑ |0.4907|± |0.0341|
| - high_school_us_history_v0.2 | 0|none | 5|acc |↑ |0.7877|± |0.0307|
| - high_school_world_history_v0.2 | 0|none | 5|acc |↑ |0.7606|± |0.0293|
| - human_aging_v0.2 | 0|none | 5|acc |↑ |0.6887|± |0.0319|
| - human_sexuality_v0.2 | 0|none | 5|acc |↑ |0.6870|± |0.0434|
| - humanities_v0.2 |N/A |none | 5|acc |↑ |0.5448|± |0.0071|
| - international_law_v0.2 | 0|none | 5|acc |↑ |0.7769|± |0.0380|
| - jurisprudence_v0.2 | 0|none | 5|acc |↑ |0.7736|± |0.0408|
| - logical_fallacies_v0.2 | 0|none | 5|acc |↑ |0.6460|± |0.0378|
| - machine_learning_v0.2 | 0|none | 5|acc |↑ |0.4732|± |0.0474|
| - management_v0.2 | 0|none | 5|acc |↑ |0.7778|± |0.0420|
| - marketing_v0.2 | 0|none | 5|acc |↑ |0.8065|± |0.0269|
| - medical_genetics_v0.2 | 0|none | 5|acc |↑ |0.7263|± |0.0460|
| - miscellaneous_v0.2 | 0|none | 5|acc |↑ |0.7676|± |0.0153|
| - moral_disputes_v0.2 | 0|none | 5|acc |↑ |0.7078|± |0.0260|
| - moral_scenarios_v0.2 | 0|none | 5|acc |↑ |0.3372|± |0.0160|
| - nutrition_v0.2 | 0|none | 5|acc |↑ |0.6852|± |0.0266|
| - other_v0.2 |N/A |none | 5|acc |↑ |0.6642|± |0.0083|
| - philosophy_v0.2 | 0|none | 5|acc |↑ |0.6555|± |0.0275|
| - prehistory_v0.2 | 0|none | 5|acc |↑ |0.7167|± |0.0261|
| - professional_accounting_v0.2 | 0|none | 5|acc |↑ |0.3943|± |0.0293|
| - professional_law_v0.2 | 0|none | 5|acc |↑ |0.4294|± |0.0133|
| - professional_medicine_v0.2 | 0|none | 5|acc |↑ |0.6475|± |0.0296|
| - professional_psychology_v0.2 | 0|none | 5|acc |↑ |0.6212|± |0.0199|
| - public_relations_v0.2 | 0|none | 5|acc |↑ |0.6759|± |0.0452|
| - security_studies_v0.2 | 0|none | 5|acc |↑ |0.7308|± |0.0291|
| - social_sciences_v0.2 |N/A |none | 5|acc |↑ |0.6970|± |0.0083|
| - sociology_v0.2 | 0|none | 5|acc |↑ |0.7692|± |0.0302|
| - stem |N/A |none | 5|acc |↑ |0.5751|± |0.0085|
|mmlu_tr_v0.2 |N/A |none | 0|acc |↑ |0.6122|± |0.0040|
| - us_foreign_policy_v0.2 | 0|none | 5|acc |↑ |0.7879|± |0.0413|
| - virology_v0.2 | 0|none | 5|acc |↑ |0.4906|± |0.0398|
| - world_religions_v0.2 | 0|none | 5|acc |↑ |0.7440|± |0.0338|
|truthfulqa_v0.2 |Yaml |none | 0|acc |↑ |0.5584|± |0.0158|
|winogrande_tr-v0.2 |Yaml |none | 10|acc |↑ |0.6201|± |0.0136|

Groups	Version	Filter	n-shot	Metric		Value		Stderr
- humanities_v0.2	N/A	none	5	acc	↑	0.5448	±	0.0071
- other_v0.2	N/A	none	5	acc	↑	0.6642	±	0.0083
- social_sciences_v0.2	N/A	none	5	acc	↑	0.6970	±	0.0083
- stem	N/A	none	5	acc	↑	0.5751	±	0.0085
mmlu_tr_v0.2	N/A	none	0	acc	↑	0.6122	±	0.0040

Excution command: lm_eval --model vllm --model_args "pretrained=google/gemma-2-9b-it,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.6,data_parallel_size=4,max_model_len=3048" --tasks "mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2" --output_path "/home/malhajar" --batch_size=auto
I am using tensor parallelisim to evaluate.

malhajar changed discussion status to closed Nov 23

bezir

Nov 23

•

edited 4 days ago

@malhajar Hello,

Thanks for reevaluation. I think it is because I use --model hf instead of vllm. This probably causing a problem. I realized that Huggingface does not use any inference engine as well in their leaderboard. It should be considered.

Spaces:
Duplicated from malhajar/OpenLLMTurkishLeaderboard

malhajar
/

OpenLLMTurkishLeaderboard_v0.2

Running

Different Scores