gpt2-dpo

This model is a fine-tuned version of mNLP-project/gpt2-finetuned on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6286	0.9993	668	0.6350	1.6222	1.3204	0.6496	0.3018	-780.0735	-933.2262	-34.5449	-28.7838
0.6387	2.0	1337	0.6662	1.8546	1.5416	0.6302	0.3130	-777.8622	-930.9024	-34.5110	-28.7424
0.5643	2.9993	2005	0.6635	2.0534	1.6918	0.6396	0.3616	-776.3599	-928.9147	-34.5066	-28.7168
0.4487	4.0	2674	0.6677	2.2748	1.8809	0.6451	0.3940	-774.4694	-926.7002	-34.1409	-28.2530
0.3831	4.9993	3342	0.6783	2.4765	2.0527	0.6418	0.4238	-772.7513	-924.6838	-34.0051	-28.0668
0.352	6.0	4011	0.6782	2.4441	2.0097	0.6440	0.4344	-773.1808	-925.0074	-34.0868	-28.1418
0.3189	6.9993	4679	0.6840	2.2310	1.8303	0.6343	0.4008	-774.9752	-927.1384	-33.9525	-27.9466
0.3006	8.0	5348	0.6882	2.4339	1.9918	0.6388	0.4422	-773.3604	-925.1093	-33.7716	-27.7551
0.3152	8.9993	6016	0.6891	2.4920	2.0457	0.6407	0.4462	-772.8206	-924.5289	-33.6753	-27.6463
0.2752	9.9925	6680	0.6892	2.4562	2.0151	0.6410	0.4411	-773.1274	-924.8871	-33.6818	-27.6538