chat_1000STEPS_1e7rate_01beta_DPO

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.8129	0.2	100	0.7825	-1.1957	-1.1981	0.3934	0.0024	-30.7728	-28.7020	-0.0569	-0.0566
0.8136	0.39	200	0.8828	-0.9245	-0.8916	0.4044	-0.0329	-27.7071	-25.9900	0.2762	0.2769
0.7535	0.59	300	0.8597	-1.3930	-1.4515	0.4000	0.0585	-33.3058	-30.6746	1.0803	1.0813
0.9558	0.78	400	0.8896	-0.8319	-0.7033	0.3604	-0.1285	-25.8247	-25.0635	0.4421	0.4425
0.7839	0.98	500	0.7987	-0.8948	-1.0616	0.4264	0.1667	-29.4069	-25.6928	0.6877	0.6886
0.2401	1.17	600	0.9002	-2.8266	-3.2238	0.4725	0.3972	-51.0296	-45.0107	-0.0174	-0.0164
0.2852	1.37	700	0.9362	-2.6553	-3.0787	0.4769	0.4234	-49.5784	-43.2978	-0.1079	-0.1069
0.2151	1.56	800	0.9663	-2.5826	-3.1268	0.5011	0.5443	-50.0594	-42.5702	-0.1730	-0.1719
0.2376	1.76	900	0.9701	-2.8346	-3.3672	0.4945	0.5326	-52.4633	-45.0905	-0.2881	-0.2870
0.2943	1.95	1000	0.9688	-2.8329	-3.3687	0.4989	0.5358	-52.4786	-45.0740	-0.2885	-0.2875