Safetensors
English
llama
GitBag's picture
Update README.md
9ea33a8 verified
---
license: apache-2.0
datasets:
- openbmb/UltraInteract_pair
language:
- en
base_model: meta-llama/Meta-Llama-3-8B-Instruct
---
This is a model released for our paper: [Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF](https://arxiv.org/abs/2410.04612).
# REFUEL-Llama-3-Armo-iter_2
This model is developed with REFUEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) as the reward model and [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract_pair) dataset.
The training code is available at https://github.com/ZhaolinGao/REFUEL.
## Evaluations
<table>
<tr>
<th rowspan="2">Method</th>
<th rowspan="2">Dataset</th>
<th colspan="6">Winrate at Turn</th>
</tr>
<tr>
<th>h = 1</th>
<th>h = 2</th>
<th>h = 3</th>
<th>h = 4</th>
<th>H = 5</th>
<th>avg</th>
</tr>
<tr>
<td>Llama-3.1-70B-it</td>
<td> N/A </td>
<td>70.4</td>
<td>66.4</td>
<td>61.0</td>
<td>53.0</td>
<td>55.4</td>
<td>61.24</td>
</tr>
<tr>
<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_1">REFUEL-Llama-3-Armo-iter_1</a></td>
<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_1">REFUEL-Ultrainteract-Llama-3-Armo-iter_1</a></td>
<td>54.6</td>
<td>53.6</td>
<td>57.8</td>
<td>56.2</td>
<td>59.4</td>
<td>56.32</td>
</tr>
<tr>
<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_2">REFUEL-Llama-3-Armo-iter_2</a></td>
<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_2">REFUEL-Ultrainteract-Llama-3-Armo-iter_2</a></td>
<td>55.2</td>
<td>53.4</td>
<td>58.8</td>
<td>57.2</td>
<td>58.6</td>
<td>56.64</td>
</tr>
</table>
## Citation
Please cite our paper if you use this model in your own work:
```
@misc{gao2024regressingrelativefutureefficient,
title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF},
author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun},
year={2024},
eprint={2410.04612},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.04612},
}
```