|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# LM Loss OPT RM |
|
|
|
This is a fine tuned OPT 13b model for reward modelling. The finetuning has been done on top of the full [SLF5K](https://huggingface.co/datasets/JeremyAlain/SLF5K) dataset following the method presented in the paper [Training Language Models with Language Feedback at Scale](https://arxiv.org/abs/2303.16755). The main results can be seen in the following table: |
|
|
|
| Model | # Params | Validation Accuracy (in %) | |
|
|--------------------|-----------|-------------------| |
|
| OPT LM Loss | 13B | **73.4 +/- 1.9** | |
|
| OPT LM Loss | 1.3B | 69.6 +/- 2.0 | |
|
| OPT RM Loss | 13B | 71.8 +/- 2.0 | |
|
|
|
If using this model, please cite the following paper: |
|
|
|
``` |
|
@article{scheurer2023training, |
|
title={Training Language Models with Language Feedback at Scale}, |
|
author={Scheurer, J{\'e}r{\'e}my and Campos, Jon Ander and Korbak, Tomasz and Chan, Jun Shern and Chen, Angelica and Cho, Kyunghyun and Perez, Ethan}, |
|
journal={arXiv preprint arXiv:2303.16755}, |
|
year={2023} |
|
} |
|
``` |