Cornell-AGI
/

REFUEL-Llama-3-Armo-iter_2

Model card Files Files and versions Community

REFUEL-Llama-3-Armo-iter_2 / README.md

GitBag's picture

Update README.md

9ea33a8 verified about 1 month ago

|

history blame contribute delete

2.48 kB

	---
	license: apache-2.0
	datasets:
	- openbmb/UltraInteract_pair
	language:
	- en
	base_model: meta-llama/Meta-Llama-3-8B-Instruct
	---
	This is a model released for our paper: [Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF](https://arxiv.org/abs/2410.04612).

	# REFUEL-Llama-3-Armo-iter_2

	This model is developed with REFUEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) as the reward model and [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract_pair) dataset.
	The training code is available at https://github.com/ZhaolinGao/REFUEL.

	## Evaluations

	<table>
	<tr>
	<th rowspan="2">Method</th>
	<th rowspan="2">Dataset</th>
	<th colspan="6">Winrate at Turn</th>
	</tr>
	<tr>
	<th>h = 1</th>
	<th>h = 2</th>
	<th>h = 3</th>
	<th>h = 4</th>
	<th>H = 5</th>
	<th>avg</th>
	</tr>
	<tr>
	<td>Llama-3.1-70B-it</td>
	<td> N/A </td>
	<td>70.4</td>
	<td>66.4</td>
	<td>61.0</td>
	<td>53.0</td>
	<td>55.4</td>
	<td>61.24</td>
	</tr>
	<tr>
	<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_1">REFUEL-Llama-3-Armo-iter_1</a></td>
	<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_1">REFUEL-Ultrainteract-Llama-3-Armo-iter_1</a></td>
	<td>54.6</td>
	<td>53.6</td>
	<td>57.8</td>
	<td>56.2</td>
	<td>59.4</td>
	<td>56.32</td>
	</tr>
	<tr>
	<td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_2">REFUEL-Llama-3-Armo-iter_2</a></td>
	<td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_2">REFUEL-Ultrainteract-Llama-3-Armo-iter_2</a></td>
	<td>55.2</td>
	<td>53.4</td>
	<td>58.8</td>
	<td>57.2</td>
	<td>58.6</td>
	<td>56.64</td>
	</tr>
	</table>

	## Citation
	Please cite our paper if you use this model in your own work:
	```
	@misc{gao2024regressingrelativefutureefficient,
	title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF},
	author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun},
	year={2024},
	eprint={2410.04612},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2410.04612},
	}
	```