dpo-model-lora

This model is a fine-tuned version of Qwen/Qwen2-0.5B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6884	0.1030	50	0.6879	-0.0543	-0.0734	0.6484	0.0191	-351.5229	-371.7161	-2.2877	-2.3628
0.6787	0.2060	100	0.6770	-0.1811	-0.2114	0.6016	0.0303	-352.9030	-372.9836	-2.2815	-2.3565
0.6721	0.3090	150	0.6721	-0.2679	-0.3094	0.6562	0.0415	-353.8831	-373.8524	-2.2782	-2.3536
0.6668	0.4119	200	0.6665	-0.4037	-0.4625	0.6016	0.0588	-355.4139	-375.2100	-2.2758	-2.3515
0.6597	0.5149	250	0.6612	-0.4907	-0.5505	0.6172	0.0598	-356.2946	-376.0805	-2.2757	-2.3510
0.6581	0.6179	300	0.6578	-0.6137	-0.6975	0.625	0.0838	-357.7639	-377.3098	-2.2736	-2.3491
0.6536	0.7209	350	0.6556	-0.6458	-0.7367	0.6328	0.0909	-358.1565	-377.6311	-2.2732	-2.3489
0.6486	0.8239	400	0.6556	-0.7025	-0.7958	0.6328	0.0933	-358.7473	-378.1981	-2.2737	-2.3493
0.649	0.9269	450	0.6556	-0.7432	-0.8327	0.6484	0.0896	-359.1166	-378.6048	-2.2726	-2.3482