Model Card for Kurtis-E1.1-Qwen2.5-3B-Instruct

Kurtis E1.1 fine-tuned with flower

Eval Results

Evaluation tasks were performed with the LM Evaluation Harness on an NVIDIA A40.

hellaswag

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks hellaswag --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.5555	±	0.0050
		none	0	acc_norm	↑	0.7412	±	0.0044

arc_easy

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks arc_easy --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_easy	1	none	0	acc	↑	0.7710	±	0.0086
		none	0	acc_norm	↑	0.6789	±	0.0096

arc_challenge

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks arc_challenge --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	0	acc	↑	0.436	±	0.0145
		none	0	acc_norm	↑	0.448	±	0.0145

mmlu

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks mmlu --device cuda:0 --batch_size 8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc	↑	0.6522	±	0.0038
- humanities	2	none		acc	↑	0.5734	±	0.0066
- formal_logic	1	none	0	acc	↑	0.4603	±	0.0446
- high_school_european_history	1	none	0	acc	↑	0.7939	±	0.0316
- high_school_us_history	1	none	0	acc	↑	0.8333	±	0.0262
- high_school_world_history	1	none	0	acc	↑	0.8397	±	0.0239
- international_law	1	none	0	acc	↑	0.7769	±	0.0380
- jurisprudence	1	none	0	acc	↑	0.7963	±	0.0389
- logical_fallacies	1	none	0	acc	↑	0.7975	±	0.0316
- moral_disputes	1	none	0	acc	↑	0.6850	±	0.0250
- moral_scenarios	1	none	0	acc	↑	0.2905	±	0.0152
- philosophy	1	none	0	acc	↑	0.7106	±	0.0258
- prehistory	1	none	0	acc	↑	0.7438	±	0.0243
- professional_law	1	none	0	acc	↑	0.4759	±	0.0128
- world_religions	1	none	0	acc	↑	0.8246	±	0.0292
- other	2	none		acc	↑	0.7087	±	0.0079
- business_ethics	1	none	0	acc	↑	0.7300	±	0.0446
- clinical_knowledge	1	none	0	acc	↑	0.7321	±	0.0273
- college_medicine	1	none	0	acc	↑	0.6705	±	0.0358
- global_facts	1	none	0	acc	↑	0.3900	±	0.0490
- human_aging	1	none	0	acc	↑	0.7130	±	0.0304
- management	1	none	0	acc	↑	0.7961	±	0.0399
- marketing	1	none	0	acc	↑	0.8803	±	0.0213
- medical_genetics	1	none	0	acc	↑	0.7600	±	0.0429
- miscellaneous	1	none	0	acc	↑	0.7957	±	0.0144
- nutrition	1	none	0	acc	↑	0.7353	±	0.0253
- professional_accounting	1	none	0	acc	↑	0.5426	±	0.0297
- professional_medicine	1	none	0	acc	↑	0.6434	±	0.0291
- virology	1	none	0	acc	↑	0.4880	±	0.0389
- social sciences	2	none		acc	↑	0.7618	±	0.0076
- econometrics	1	none	0	acc	↑	0.5439	±	0.0469
- high_school_geography	1	none	0	acc	↑	0.7677	±	0.0301
- high_school_government_and_politics	1	none	0	acc	↑	0.8860	±	0.0229
- high_school_macroeconomics	1	none	0	acc	↑	0.6949	±	0.0233
- high_school_microeconomics	1	none	0	acc	↑	0.7773	±	0.0270
- high_school_psychology	1	none	0	acc	↑	0.8477	±	0.0154
- human_sexuality	1	none	0	acc	↑	0.7786	±	0.0364
- professional_psychology	1	none	0	acc	↑	0.7075	±	0.0184
- public_relations	1	none	0	acc	↑	0.6818	±	0.0446
- security_studies	1	none	0	acc	↑	0.7224	±	0.0287
- sociology	1	none	0	acc	↑	0.8458	±	0.0255
- us_foreign_policy	1	none	0	acc	↑	0.8400	±	0.0368
- stem	2	none		acc	↑	0.6070	±	0.0085
- abstract_algebra	1	none	0	acc	↑	0.4700	±	0.0502
- anatomy	1	none	0	acc	↑	0.6667	±	0.0407
- astronomy	1	none	0	acc	↑	0.6776	±	0.0380
- college_biology	1	none	0	acc	↑	0.7222	±	0.0375
- college_chemistry	1	none	0	acc	↑	0.5000	±	0.0503
- college_computer_science	1	none	0	acc	↑	0.6000	±	0.0492
- college_mathematics	1	none	0	acc	↑	0.3400	±	0.0476
- college_physics	1	none	0	acc	↑	0.4902	±	0.0497
- computer_security	1	none	0	acc	↑	0.7000	±	0.0461
- conceptual_physics	1	none	0	acc	↑	0.6468	±	0.0312
- electrical_engineering	1	none	0	acc	↑	0.6690	±	0.0392
- elementary_mathematics	1	none	0	acc	↑	0.5979	±	0.0253
- high_school_biology	1	none	0	acc	↑	0.8129	±	0.0222
- high_school_chemistry	1	none	0	acc	↑	0.5813	±	0.0347
- high_school_computer_science	1	none	0	acc	↑	0.7800	±	0.0416
- high_school_mathematics	1	none	0	acc	↑	0.5037	±	0.0305
- high_school_physics	1	none	0	acc	↑	0.4437	±	0.0406
- high_school_statistics	1	none	0	acc	↑	0.5972	±	0.0334
- machine_learning	1	none	0	acc	↑	0.4554	±	0.0473

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.6522	±	0.0038
- humanities	2	none	acc	↑	0.5734	±	0.0066
- other	2	none	acc	↑	0.7087	±	0.0079
- social sciences	2	none	acc	↑	0.7618	±	0.0076
- stem	2	none	acc	↑	0.6070	±	0.0085

mmlu (5-shot)

lm_eval   --model hf   --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct   --tasks mmlu   --device cuda:0   --batch_size 8 --num_fewshot 5

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc	↑	0.6629	±	0.0038
- humanities	2	none		acc	↑	0.5862	±	0.0067
- formal_logic	1	none	5	acc	↑	0.4683	±	0.0446
- high_school_european_history	1	none	5	acc	↑	0.7818	±	0.0323
- high_school_us_history	1	none	5	acc	↑	0.8284	±	0.0265
- high_school_world_history	1	none	5	acc	↑	0.8692	±	0.0219
- international_law	1	none	5	acc	↑	0.7769	±	0.0380
- jurisprudence	1	none	5	acc	↑	0.7963	±	0.0389
- logical_fallacies	1	none	5	acc	↑	0.8098	±	0.0308
- moral_disputes	1	none	5	acc	↑	0.7110	±	0.0244
- moral_scenarios	1	none	5	acc	↑	0.3464	±	0.0159
- philosophy	1	none	5	acc	↑	0.7042	±	0.0259
- prehistory	1	none	5	acc	↑	0.7284	±	0.0247
- professional_law	1	none	5	acc	↑	0.4759	±	0.0128
- world_religions	1	none	5	acc	↑	0.8304	±	0.0288
- other	2	none		acc	↑	0.7171	±	0.0078
- business_ethics	1	none	5	acc	↑	0.7400	±	0.0441
- clinical_knowledge	1	none	5	acc	↑	0.7321	±	0.0273
- college_medicine	1	none	5	acc	↑	0.6647	±	0.0360
- global_facts	1	none	5	acc	↑	0.4100	±	0.0494
- human_aging	1	none	5	acc	↑	0.7220	±	0.0301
- management	1	none	5	acc	↑	0.7864	±	0.0406
- marketing	1	none	5	acc	↑	0.8889	±	0.0206
- medical_genetics	1	none	5	acc	↑	0.7900	±	0.0409
- miscellaneous	1	none	5	acc	↑	0.7957	±	0.0144
- nutrition	1	none	5	acc	↑	0.7680	±	0.0242
- professional_accounting	1	none	5	acc	↑	0.5532	±	0.0297
- professional_medicine	1	none	5	acc	↑	0.6471	±	0.0290
- virology	1	none	5	acc	↑	0.5120	±	0.0389
- social sciences	2	none		acc	↑	0.7735	±	0.0075
- econometrics	1	none	5	acc	↑	0.5877	±	0.0463
- high_school_geography	1	none	5	acc	↑	0.7828	±	0.0294
- high_school_government_and_politics	1	none	5	acc	↑	0.8756	±	0.0238
- high_school_macroeconomics	1	none	5	acc	↑	0.7051	±	0.0231
- high_school_microeconomics	1	none	5	acc	↑	0.7773	±	0.0270
- high_school_psychology	1	none	5	acc	↑	0.8550	±	0.0151
- human_sexuality	1	none	5	acc	↑	0.8092	±	0.0345
- professional_psychology	1	none	5	acc	↑	0.7288	±	0.0180
- public_relations	1	none	5	acc	↑	0.6909	±	0.0443
- security_studies	1	none	5	acc	↑	0.7551	±	0.0275
- sociology	1	none	5	acc	↑	0.8308	±	0.0265
- us_foreign_policy	1	none	5	acc	↑	0.8300	±	0.0378
- stem	2	none		acc	↑	0.6159	±	0.0084
- abstract_algebra	1	none	5	acc	↑	0.5000	±	0.0503
- anatomy	1	none	5	acc	↑	0.6222	±	0.0419
- astronomy	1	none	5	acc	↑	0.7500	±	0.0352
- college_biology	1	none	5	acc	↑	0.7083	±	0.0380
- college_chemistry	1	none	5	acc	↑	0.4700	±	0.0502
- college_computer_science	1	none	5	acc	↑	0.6200	±	0.0488
- college_mathematics	1	none	5	acc	↑	0.4000	±	0.0492
- college_physics	1	none	5	acc	↑	0.4902	±	0.0497
- computer_security	1	none	5	acc	↑	0.8200	±	0.0386
- conceptual_physics	1	none	5	acc	↑	0.6383	±	0.0314
- electrical_engineering	1	none	5	acc	↑	0.6483	±	0.0398
- elementary_mathematics	1	none	5	acc	↑	0.5820	±	0.0254
- high_school_biology	1	none	5	acc	↑	0.8161	±	0.0220
- high_school_chemistry	1	none	5	acc	↑	0.6059	±	0.0344
- high_school_computer_science	1	none	5	acc	↑	0.7500	±	0.0435
- high_school_mathematics	1	none	5	acc	↑	0.4926	±	0.0305
- high_school_physics	1	none	5	acc	↑	0.4702	±	0.0408
- high_school_statistics	1	none	5	acc	↑	0.6343	±	0.0328
- machine_learning	1	none	5	acc	↑	0.4911	±	0.0475

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.6629	±	0.0038
- humanities	2	none	acc	↑	0.5862	±	0.0067
- other	2	none	acc	↑	0.7171	±	0.0078
- social sciences	2	none	acc	↑	0.7735	±	0.0075
- stem	2	none	acc	↑	0.6159	±	0.0084

ethicalabs
/

Kurtis-E1.1-Qwen2.5-3B-Instruct

Model Card for Kurtis-E1.1-Qwen2.5-3B-Instruct

Eval Results

hellaswag

arc_easy

arc_challenge

mmlu

mmlu (5-shot)

Model tree for ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct

Dataset used to train ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct

Space using ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct 1

Collection including ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct

Kurtis E1