metadata
license: apache-2.0
datasets:
- openbmb/UltraInteract_pair
language:
- en
base_model: meta-llama/Meta-Llama-3-8B-Instruct
This is a model released for our paper: Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF.
REFUEL-Llama-3-Armo-iter_1
This model is developed with REFUEL based on Meta-Llama-3-8B-Instruct with ArmoRM-Llama3-8B-v0.1 as the reward model and UltraInteract dataset. The training code is available at https://github.com/ZhaolinGao/REFUEL.
Evaluations
Method | Dataset | Winrate at Turn | |||||
---|---|---|---|---|---|---|---|
h = 1 | h = 2 | h = 3 | h = 4 | H = 5 | avg | ||
Llama-3.1-70B-it | N/A | 70.4 | 66.4 | 61.0 | 53.0 | 55.4 | 61.24 |
REFUEL-Llama-3-Armo-iter_1 | REFUEL-Ultrainteract-Llama-3-Armo-iter_1 | 54.6 | 53.6 | 57.8 | 56.2 | 59.4 | 56.32 |
REFUEL-Llama-3-Armo-iter_2 | REFUEL-Ultrainteract-Llama-3-Armo-iter_2 | 55.2 | 53.4 | 58.8 | 57.2 | 58.6 | 56.64 |
Citation
Please cite our paper if you use this model in your own work:
@misc{gao2024regressingrelativefutureefficient,
title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF},
author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun},
year={2024},
eprint={2410.04612},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.04612},
}