|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- openbmb/UltraInteract_pair |
|
language: |
|
- en |
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct |
|
--- |
|
This is a model released for our paper: [Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF](https://arxiv.org/abs/2410.04612). |
|
|
|
# REFUEL-Llama-3-Armo-iter_2 |
|
|
|
This model is developed with REFUEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) as the reward model and [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract_pair) dataset. |
|
The training code is available at https://github.com/ZhaolinGao/REFUEL. |
|
|
|
## Evaluations |
|
|
|
<table> |
|
<tr> |
|
<th rowspan="2">Method</th> |
|
<th rowspan="2">Dataset</th> |
|
<th colspan="6">Winrate at Turn</th> |
|
</tr> |
|
<tr> |
|
<th>h = 1</th> |
|
<th>h = 2</th> |
|
<th>h = 3</th> |
|
<th>h = 4</th> |
|
<th>H = 5</th> |
|
<th>avg</th> |
|
</tr> |
|
<tr> |
|
<td>Llama-3.1-70B-it</td> |
|
<td> N/A </td> |
|
<td>70.4</td> |
|
<td>66.4</td> |
|
<td>61.0</td> |
|
<td>53.0</td> |
|
<td>55.4</td> |
|
<td>61.24</td> |
|
</tr> |
|
<tr> |
|
<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_1">REFUEL-Llama-3-Armo-iter_1</a></td> |
|
<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_1">REFUEL-Ultrainteract-Llama-3-Armo-iter_1</a></td> |
|
<td>54.6</td> |
|
<td>53.6</td> |
|
<td>57.8</td> |
|
<td>56.2</td> |
|
<td>59.4</td> |
|
<td>56.32</td> |
|
</tr> |
|
<tr> |
|
<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_2">REFUEL-Llama-3-Armo-iter_2</a></td> |
|
<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_2">REFUEL-Ultrainteract-Llama-3-Armo-iter_2</a></td> |
|
<td>55.2</td> |
|
<td>53.4</td> |
|
<td>58.8</td> |
|
<td>57.2</td> |
|
<td>58.6</td> |
|
<td>56.64</td> |
|
</tr> |
|
</table> |
|
|
|
## Citation |
|
Please cite our paper if you use this model in your own work: |
|
``` |
|
@misc{gao2024regressingrelativefutureefficient, |
|
title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF}, |
|
author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun}, |
|
year={2024}, |
|
eprint={2410.04612}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2410.04612}, |
|
} |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|