Edit model card

phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

This model is a fine-tuned version of DUAL-GPO/phi-2-gpo-renew2-b0.001-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0497
  • Rewards/chosen: 0.0617
  • Rewards/rejected: 0.0473
  • Rewards/accuracies: 0.5645
  • Rewards/margins: 0.0144
  • Logps/rejected: -1829.1201
  • Logps/chosen: -2154.7461
  • Logits/rejected: -0.2678
  • Logits/chosen: -0.2583

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.0515 0.05 100 0.0532 0.0078 0.0065 0.5190 0.0013 -1869.9457 -2208.6421 -0.2109 -0.2202
0.0386 0.1 200 0.0515 0.0511 0.0427 0.5095 0.0083 -1833.6853 -2165.3538 -0.2153 -0.2175
0.0428 0.16 300 0.0515 0.0358 0.0281 0.5465 0.0077 -1848.3311 -2180.6155 -0.2312 -0.2333
0.0513 0.21 400 0.0520 0.0645 0.0516 0.5305 0.0129 -1824.8289 -2151.9404 -0.2533 -0.2474
0.0565 0.26 500 0.0507 0.0520 0.0403 0.5565 0.0117 -1836.1078 -2164.4390 -0.2774 -0.2711
0.0549 0.31 600 0.0504 0.0581 0.0443 0.5535 0.0138 -1832.1049 -2158.2695 -0.3657 -0.3506
0.0445 0.37 700 0.0504 0.0480 0.0362 0.5575 0.0118 -1840.2194 -2168.3940 -0.3268 -0.3160
0.0584 0.42 800 0.0504 0.0547 0.0417 0.5530 0.0130 -1834.7174 -2161.7117 -0.3244 -0.3128
0.0439 0.47 900 0.0501 0.0743 0.0588 0.5455 0.0155 -1817.6077 -2142.0779 -0.3005 -0.2897
0.0545 0.52 1000 0.0500 0.0612 0.0477 0.5580 0.0135 -1828.6910 -2155.1626 -0.2889 -0.2812
0.0535 0.58 1100 0.0499 0.0762 0.0605 0.5480 0.0158 -1815.9238 -2140.1655 -0.2758 -0.2662
0.0484 0.63 1200 0.0499 0.0611 0.0476 0.5545 0.0135 -1828.7972 -2155.2605 -0.2614 -0.2536
0.0443 0.68 1300 0.0499 0.0536 0.0409 0.5640 0.0127 -1835.5480 -2162.8499 -0.2628 -0.2563
0.0527 0.73 1400 0.0500 0.0536 0.0406 0.5705 0.0130 -1835.7953 -2162.7734 -0.2801 -0.2716
0.0427 0.79 1500 0.0499 0.0581 0.0443 0.5655 0.0137 -1832.0787 -2158.3472 -0.2702 -0.2613
0.0391 0.84 1600 0.0498 0.0624 0.0479 0.5625 0.0145 -1828.5033 -2153.9939 -0.2688 -0.2594
0.056 0.89 1700 0.0498 0.0626 0.0481 0.5615 0.0145 -1828.3557 -2153.8423 -0.2686 -0.2589
0.0505 0.94 1800 0.0498 0.0619 0.0476 0.5655 0.0144 -1828.8563 -2154.4631 -0.2667 -0.2571
0.0501 0.99 1900 0.0498 0.0617 0.0473 0.5635 0.0144 -1829.1072 -2154.7471 -0.2678 -0.2582

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.1.2
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
2
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

Base model

microsoft/phi-2
Adapter
(633)
this model

Dataset used to train DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1