Edit model card

phi-2-gpo-renew2-b0.001-0.5ultrafeedback-rank256-i1

This model is a fine-tuned version of DUAL-GPO/phi-2-gpo-renew2-b0.001-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0492
  • Rewards/chosen: 0.0680
  • Rewards/rejected: 0.0518
  • Rewards/accuracies: 0.5700
  • Rewards/margins: 0.0162
  • Logps/rejected: -1824.6278
  • Logps/chosen: -2148.4272
  • Logits/rejected: -0.2254
  • Logits/chosen: -0.2188

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.0519 0.05 100 0.0524 0.0326 0.0267 0.5255 0.0059 -1849.7029 -2183.7695 -0.2520 -0.2532
0.0379 0.1 200 0.0514 0.0434 0.0342 0.5390 0.0093 -1842.2477 -2172.9629 -0.2760 -0.2740
0.0425 0.16 300 0.0513 0.0344 0.0246 0.5630 0.0098 -1851.8621 -2182.0454 -0.2899 -0.2901
0.0522 0.21 400 0.0520 0.0818 0.0659 0.5250 0.0159 -1810.5037 -2134.5779 -0.2777 -0.2683
0.0559 0.26 500 0.0502 0.0610 0.0475 0.5625 0.0134 -1828.8737 -2155.4170 -0.2991 -0.2903
0.0546 0.31 600 0.0504 0.0493 0.0369 0.5525 0.0124 -1839.5269 -2167.1086 -0.3840 -0.3719
0.0443 0.37 700 0.0501 0.0527 0.0403 0.5670 0.0124 -1836.1396 -2163.6941 -0.3238 -0.3145
0.0583 0.42 800 0.0502 0.0559 0.0434 0.5625 0.0125 -1833.0012 -2160.5334 -0.3079 -0.2990
0.0432 0.47 900 0.0500 0.0872 0.0702 0.5485 0.0170 -1806.2087 -2129.1819 -0.2529 -0.2455
0.0538 0.52 1000 0.0496 0.0598 0.0468 0.5650 0.0129 -1829.5831 -2156.6528 -0.2593 -0.2565
0.0545 0.58 1100 0.0495 0.0923 0.0738 0.5560 0.0185 -1802.5935 -2124.1084 -0.2394 -0.2311
0.0481 0.63 1200 0.0495 0.0607 0.0467 0.5685 0.0140 -1829.7306 -2155.7429 -0.2181 -0.2147
0.0441 0.68 1300 0.0495 0.0567 0.0429 0.5690 0.0139 -1833.5485 -2159.6755 -0.2202 -0.2175
0.0524 0.73 1400 0.0496 0.0527 0.0389 0.5685 0.0138 -1837.5038 -2163.6599 -0.2475 -0.2422
0.0425 0.79 1500 0.0493 0.0621 0.0466 0.5675 0.0154 -1829.7928 -2154.3403 -0.2335 -0.2274
0.0387 0.84 1600 0.0492 0.0712 0.0545 0.5705 0.0167 -1821.8909 -2145.1594 -0.2298 -0.2230
0.0556 0.89 1700 0.0492 0.0673 0.0511 0.5675 0.0161 -1825.2786 -2149.1382 -0.2259 -0.2196
0.0519 0.94 1800 0.0492 0.0683 0.0521 0.5690 0.0162 -1824.3348 -2148.1096 -0.2241 -0.2176
0.05 0.99 1900 0.0492 0.0679 0.0518 0.5670 0.0162 -1824.6459 -2148.4578 -0.2254 -0.2187

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.1.2
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
2
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-rank256-i1

Base model

microsoft/phi-2
Adapter
(633)
this model

Dataset used to train DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-rank256-i1