gpt2-dpo-with-cosine-lr-scheduler

This model is a fine-tuned version of mNLP-project/gpt2-finetuned on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.9846	1.0	1337	1.1168	3.8849	3.2031	0.5892	0.6818	-761.2470	-910.5992	-36.5651	-30.3810
0.6025	2.0	2674	1.1405	5.0060	4.0992	0.6175	0.9068	-752.2864	-899.3887	-35.0528	-28.9839
0.2464	3.0	4011	1.1202	4.6754	3.6835	0.6160	0.9919	-756.4427	-902.6943	-39.6513	-33.3219
0.1182	4.0	5348	1.3054	7.3114	5.8367	0.6131	1.4747	-734.9108	-876.3349	-35.1974	-28.6005
0.0669	5.0	6685	1.3846	6.5378	5.0738	0.6093	1.4640	-742.5399	-884.0710	-39.0355	-31.8814
0.0226	6.0	8022	1.4662	6.2901	4.6812	0.6052	1.6089	-746.4659	-886.5475	-40.3811	-32.9593
0.0128	7.0	9359	1.5557	5.8081	4.1554	0.6108	1.6527	-751.7241	-891.3676	-39.1744	-31.2704
0.019	8.0	10696	1.6676	5.5428	3.8458	0.6011	1.6970	-754.8205	-894.0207	-40.5161	-32.4700
0.0101	9.0	12033	1.7100	5.5531	3.8215	0.6022	1.7315	-755.0627	-893.9178	-40.7171	-32.5929
0.0053	10.0	13370	1.7177	5.4221	3.7030	0.6000	1.7191	-756.2481	-895.2274	-40.8064	-32.6689