zephyr-dpop-qlora-uf-ours-5e-6

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the generation/UF dataset. It achieves the following results on the evaluation set:

Loss: 5.1264
Positive Losses: 43.1884
Dpo Losses: 0.6101
Rewards/chosen: -0.3903
Rewards/rejected: -0.7274
Rewards/accuracies: 0.6670
Rewards/margins: 0.3370
Rewards/margins Max: 1.4167
Rewards/margins Min: -0.8378
Rewards/margins Std: 0.7707
Logps/rejected: -331.3143
Logps/chosen: -323.6263
Logits/rejected: -2.4808
Logits/chosen: -2.5277

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 2
total_train_batch_size: 16
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Positive Losses	Dpo Losses	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Rewards/margins Max	Rewards/margins Min	Rewards/margins Std	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6302	0.28	100	0.8170	1.2658	0.6732	0.0877	0.0421	0.5920	0.0456	0.2717	-0.1472	0.1389	-254.3717	-275.8216	-2.6655	-2.7015
0.5709	0.56	200	2.1527	14.1341	0.6518	-0.0932	-0.2071	0.6360	0.1139	0.6302	-0.3647	0.3297	-279.2877	-293.9101	-2.6591	-2.6989
0.4758	0.85	300	2.2508	15.0103	0.6396	-0.0829	-0.2324	0.6590	0.1495	0.7147	-0.4138	0.3813	-281.8231	-292.8875	-2.6866	-2.7294
0.4857	1.13	400	2.8413	20.4422	0.6295	-0.1464	-0.3473	0.6540	0.2010	0.9605	-0.5524	0.5026	-293.3139	-299.2286	-2.5810	-2.6240
0.6015	1.41	500	2.4297	16.2472	0.6215	-0.0798	-0.3011	0.6660	0.2213	0.9834	-0.5416	0.5125	-288.6871	-292.5703	-2.5803	-2.6246
0.4849	1.69	600	3.8077	30.0769	0.6153	-0.2435	-0.5155	0.6630	0.2721	1.1651	-0.6779	0.6337	-310.1338	-308.9421	-2.5659	-2.6120
0.4012	1.97	700	4.4359	36.7814	0.6160	-0.3161	-0.6003	0.6660	0.2841	1.2285	-0.7320	0.6759	-318.6039	-316.2043	-2.5208	-2.5672
0.3245	2.25	800	4.9873	41.8073	0.6123	-0.3752	-0.6988	0.6660	0.3236	1.3768	-0.8214	0.7506	-328.4567	-322.1156	-2.4952	-2.5421
0.3018	2.54	900	5.0342	42.1224	0.6084	-0.3810	-0.7194	0.6680	0.3383	1.4141	-0.8336	0.7645	-330.5147	-322.6951	-2.4804	-2.5276
0.4364	2.82	1000	5.0975	42.8746	0.6098	-0.3872	-0.7242	0.6680	0.3370	1.4157	-0.8369	0.7695	-331.0000	-323.3101	-2.4816	-2.5285

Framework versions

PEFT 0.7.1
Transformers 4.39.0.dev0
Pytorch 2.1.2+cu121
Datasets 2.14.6
Tokenizers 0.15.2

just1nseo
/

zephyr-dpop-qlora-uf-ours-5e-6

zephyr-dpop-qlora-uf-ours-5e-6

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for just1nseo/zephyr-dpop-qlora-uf-ours-5e-6

Evaluation results