SPPO
Collection
Self-Play Preference Optimization
•
10 items
•
Updated
•
13
Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)
This model was developed using Self-Play Preference Optimization at iteration 2, based on the google/gemma-2-9b-it architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.
Terms of Use: Terms
Model | LC. Win Rate | Win Rate | Avg. Length |
---|---|---|---|
Llama-3-8B-SPPO Iter1 | 48.70 | 40.76 | 1669 |
Llama-3-8B-SPPO Iter2 | 50.93 | 44.64 | 1759 |
Llama-3-8B-SPPO Iter3 | 53.27 | 47.74 | 1803 |
The following hyperparameters were used during training:
@misc{wu2024self,
title={Self-Play Preference Optimization for Language Model Alignment},
author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
year={2024},
eprint={2405.00675},
archivePrefix={arXiv},
primaryClass={cs.LG}
}