phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1
This model is a fine-tuned version of DUAL-GPO/phi-2-gpo-renew2-b0.001-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:
- Loss: 0.0497
- Rewards/chosen: 0.0617
- Rewards/rejected: 0.0473
- Rewards/accuracies: 0.5645
- Rewards/margins: 0.0144
- Logps/rejected: -1829.1201
- Logps/chosen: -2154.7461
- Logits/rejected: -0.2678
- Logits/chosen: -0.2583
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-06
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 4
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Training results
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
---|---|---|---|---|---|---|---|---|---|---|---|
0.0515 | 0.05 | 100 | 0.0532 | 0.0078 | 0.0065 | 0.5190 | 0.0013 | -1869.9457 | -2208.6421 | -0.2109 | -0.2202 |
0.0386 | 0.1 | 200 | 0.0515 | 0.0511 | 0.0427 | 0.5095 | 0.0083 | -1833.6853 | -2165.3538 | -0.2153 | -0.2175 |
0.0428 | 0.16 | 300 | 0.0515 | 0.0358 | 0.0281 | 0.5465 | 0.0077 | -1848.3311 | -2180.6155 | -0.2312 | -0.2333 |
0.0513 | 0.21 | 400 | 0.0520 | 0.0645 | 0.0516 | 0.5305 | 0.0129 | -1824.8289 | -2151.9404 | -0.2533 | -0.2474 |
0.0565 | 0.26 | 500 | 0.0507 | 0.0520 | 0.0403 | 0.5565 | 0.0117 | -1836.1078 | -2164.4390 | -0.2774 | -0.2711 |
0.0549 | 0.31 | 600 | 0.0504 | 0.0581 | 0.0443 | 0.5535 | 0.0138 | -1832.1049 | -2158.2695 | -0.3657 | -0.3506 |
0.0445 | 0.37 | 700 | 0.0504 | 0.0480 | 0.0362 | 0.5575 | 0.0118 | -1840.2194 | -2168.3940 | -0.3268 | -0.3160 |
0.0584 | 0.42 | 800 | 0.0504 | 0.0547 | 0.0417 | 0.5530 | 0.0130 | -1834.7174 | -2161.7117 | -0.3244 | -0.3128 |
0.0439 | 0.47 | 900 | 0.0501 | 0.0743 | 0.0588 | 0.5455 | 0.0155 | -1817.6077 | -2142.0779 | -0.3005 | -0.2897 |
0.0545 | 0.52 | 1000 | 0.0500 | 0.0612 | 0.0477 | 0.5580 | 0.0135 | -1828.6910 | -2155.1626 | -0.2889 | -0.2812 |
0.0535 | 0.58 | 1100 | 0.0499 | 0.0762 | 0.0605 | 0.5480 | 0.0158 | -1815.9238 | -2140.1655 | -0.2758 | -0.2662 |
0.0484 | 0.63 | 1200 | 0.0499 | 0.0611 | 0.0476 | 0.5545 | 0.0135 | -1828.7972 | -2155.2605 | -0.2614 | -0.2536 |
0.0443 | 0.68 | 1300 | 0.0499 | 0.0536 | 0.0409 | 0.5640 | 0.0127 | -1835.5480 | -2162.8499 | -0.2628 | -0.2563 |
0.0527 | 0.73 | 1400 | 0.0500 | 0.0536 | 0.0406 | 0.5705 | 0.0130 | -1835.7953 | -2162.7734 | -0.2801 | -0.2716 |
0.0427 | 0.79 | 1500 | 0.0499 | 0.0581 | 0.0443 | 0.5655 | 0.0137 | -1832.0787 | -2158.3472 | -0.2702 | -0.2613 |
0.0391 | 0.84 | 1600 | 0.0498 | 0.0624 | 0.0479 | 0.5625 | 0.0145 | -1828.5033 | -2153.9939 | -0.2688 | -0.2594 |
0.056 | 0.89 | 1700 | 0.0498 | 0.0626 | 0.0481 | 0.5615 | 0.0145 | -1828.3557 | -2153.8423 | -0.2686 | -0.2589 |
0.0505 | 0.94 | 1800 | 0.0498 | 0.0619 | 0.0476 | 0.5655 | 0.0144 | -1828.8563 | -2154.4631 | -0.2667 | -0.2571 |
0.0501 | 0.99 | 1900 | 0.0498 | 0.0617 | 0.0473 | 0.5635 | 0.0144 | -1829.1072 | -2154.7471 | -0.2678 | -0.2582 |
Framework versions
- PEFT 0.7.1
- Transformers 4.36.2
- Pytorch 2.1.2
- Datasets 2.14.6
- Tokenizers 0.15.2
- Downloads last month
- 2
Model tree for DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1
Base model
microsoft/phi-2