--- license: apache-2.0 datasets: - openbmb/UltraInteract_pair language: - en base_model: meta-llama/Meta-Llama-3-8B-Instruct --- This is a model released for our paper: [Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF](https://arxiv.org/abs/2410.04612). # REFUEL-Llama-3-Armo-iter_1 This model is developed with REFUEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) as the reward model and [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract_pair) dataset. The training code is available at https://github.com/ZhaolinGao/REFUEL. ## Evaluations

Method	Dataset	Winrate at Turn
Method	Dataset	h = 1	h = 2	h = 3	h = 4	H = 5	avg
Llama-3.1-70B-it	N/A	70.4	66.4	61.0	53.0	55.4	61.24
REFUEL-Llama-3-Armo-iter_1	REFUEL-Ultrainteract-Llama-3-Armo-iter_1	54.6	53.6	57.8	56.2	59.4	56.32
REFUEL-Llama-3-Armo-iter_2	REFUEL-Ultrainteract-Llama-3-Armo-iter_2	55.2	53.4	58.8	57.2	58.6	56.64

## Citation Please cite our paper if you use this model in your own work: ``` @misc{gao2024regressingrelativefutureefficient, title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF}, author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun}, year={2024}, eprint={2410.04612}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.04612}, } ```