chat_1000STEPS_1e7_03beta_DPO

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6935	0.2	100	0.6925	-0.0035	-0.0055	0.4286	0.0019	-18.8094	-16.7564	-0.5969	-0.5967
0.6934	0.39	200	0.6911	0.0022	-0.0027	0.4615	0.0049	-18.8003	-16.7374	-0.5979	-0.5977
0.6882	0.59	300	0.6929	-0.0047	-0.0060	0.4330	0.0013	-18.8112	-16.7601	-0.5973	-0.5972
0.6896	0.78	400	0.6907	-0.0013	-0.0070	0.4615	0.0057	-18.8147	-16.7490	-0.5982	-0.5981
0.6877	0.98	500	0.6904	0.0012	-0.0051	0.4923	0.0063	-18.8082	-16.7405	-0.5972	-0.5971
0.6829	1.17	600	0.6903	-0.0020	-0.0085	0.4703	0.0066	-18.8198	-16.7511	-0.5976	-0.5975
0.6832	1.37	700	0.6904	-0.0032	-0.0097	0.4593	0.0064	-18.8236	-16.7554	-0.5971	-0.5970
0.6802	1.56	800	0.6889	-0.0010	-0.0105	0.4923	0.0096	-18.8263	-16.7478	-0.5979	-0.5978
0.6826	1.76	900	0.6897	-0.0009	-0.0088	0.4769	0.0079	-18.8206	-16.7475	-0.5972	-0.5971
0.6761	1.95	1000	0.6902	-0.0000	-0.0069	0.4681	0.0069	-18.8144	-16.7447	-0.5973	-0.5972