llama-7b-SFT-qlora-eli5-wiki_DPO_ds_RM_contrast_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_eli5_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6867	0.1	19	0.6390	0.0633	-0.1318	0.6451	0.1951	-197.8286	-205.5991	0.7774	0.8133
0.6727	0.21	38	0.6384	0.0354	-0.2285	0.6529	0.2639	-198.3123	-205.7386	0.8054	0.8432
0.6577	0.31	57	0.6391	-0.0114	-0.2258	0.6406	0.2145	-198.2988	-205.9725	0.7954	0.8346
0.6609	0.42	76	0.6344	-0.3737	-0.6175	0.6417	0.2438	-200.2571	-207.7841	0.7818	0.8194
0.6536	0.52	95	0.6285	-0.1130	-0.3816	0.6652	0.2687	-199.0778	-206.4805	0.7958	0.8350
0.654	0.62	114	0.6342	0.0007	-0.2311	0.6484	0.2318	-198.3250	-205.9122	0.7917	0.8303
0.6435	0.73	133	0.6258	0.0462	-0.2234	0.6562	0.2696	-198.2865	-205.6845	0.7949	0.8332
0.6508	0.83	152	0.6234	0.0858	-0.1898	0.6574	0.2756	-198.1188	-205.4868	0.7931	0.8315
0.6361	0.94	171	0.6269	0.1007	-0.1655	0.6618	0.2662	-197.9971	-205.4121	0.7975	0.8353