Update README.md
Browse files
README.md
CHANGED
@@ -4,13 +4,14 @@ datasets:
|
|
4 |
- berkeley-nest/Nectar
|
5 |
language:
|
6 |
- en
|
|
|
7 |
---
|
8 |
This is a model released for our paper: [REBEL: Reinforcement Learning via Regressing Relative Rewards](https://arxiv.org/abs/2404.16767).
|
9 |
|
10 |
-
# REBEL-
|
11 |
|
12 |
This model is developed with REBEL based on [OpenChat-3.5](https://huggingface.co/openchat/openchat_3.5) with [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) as the reward model and [Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) dataset.
|
13 |
-
The training code is available at https://github.com/ZhaolinGao/REBEL.
|
14 |
|
15 |
### Links to Other Model
|
16 |
|
@@ -18,27 +19,22 @@ The training code is available at https://github.com/ZhaolinGao/REBEL.
|
|
18 |
|
19 |
[REBEL-Llama-3-epoch_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-epoch_2)
|
20 |
|
21 |
-
|
22 |
|
23 |
-
|
24 |
-
| :--------: | :--------: | :--------: |
|
25 |
-
| REBEL-OpenChat-3.5| 17.3 | 12.8 |
|
26 |
-
| REBEL-Llama-3 | 30.1 | 32.6 |
|
27 |
-
| REBEL-Llama-3-epoch_2| 31.33 | 34.22 |
|
28 |
|
29 |
-
|
30 |
|
31 |
-
|
32 |
-
| :--------: | :--------: | :--------: | :--------: |
|
33 |
-
| REBEL-OpenChat-3.5 | 8.54 | 7.58 | 8.06 |
|
34 |
-
| REBEL-Llama-3 | 8.63 | 7.69 | 8.16 |
|
35 |
|
36 |
-
|
37 |
-
|
38 |
-
|
|
39 |
-
|
|
40 |
-
| REBEL-
|
41 |
-
| REBEL-Llama-3
|
|
|
|
|
42 |
|
43 |
## Citation
|
44 |
Please cite our paper if you use this model in your own work:
|
@@ -51,11 +47,4 @@ Please cite our paper if you use this model in your own work:
|
|
51 |
archivePrefix={arXiv},
|
52 |
primaryClass={cs.LG}
|
53 |
}
|
54 |
-
```
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
|
|
4 |
- berkeley-nest/Nectar
|
5 |
language:
|
6 |
- en
|
7 |
+
base_model: openchat/openchat_3.5
|
8 |
---
|
9 |
This is a model released for our paper: [REBEL: Reinforcement Learning via Regressing Relative Rewards](https://arxiv.org/abs/2404.16767).
|
10 |
|
11 |
+
# REBEL-OpenChat-3.5
|
12 |
|
13 |
This model is developed with REBEL based on [OpenChat-3.5](https://huggingface.co/openchat/openchat_3.5) with [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) as the reward model and [Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) dataset.
|
14 |
+
The training code is available at https://github.com/ZhaolinGao/REBEL. We collect online generations during each iteration with a batch size of 32.
|
15 |
|
16 |
### Links to Other Model
|
17 |
|
|
|
19 |
|
20 |
[REBEL-Llama-3-epoch_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-epoch_2)
|
21 |
|
22 |
+
[REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1)
|
23 |
|
24 |
+
[REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2)
|
|
|
|
|
|
|
|
|
25 |
|
26 |
+
[REBEL-Llama-3-Armo-iter_3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_3)
|
27 |
|
28 |
+
### Evaluations
|
|
|
|
|
|
|
29 |
|
30 |
+
| Model | AlpacaEval 2.0<br>LC Win Rate | AlpacaEval 2.0<br>Win Rate | MT-Bench<br>Average | MMLU<br>(5-shot) | GSM8K<br>(5-shot) |
|
31 |
+
| :--------: | :--------: | :--------: | :--------: | :--------: | :--------: |
|
32 |
+
| REBEL-OpenChat-3.5| 17.3 | 12.8 | 8.06 | 63.7 | 68.8 |
|
33 |
+
| REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 |
|
34 |
+
| REBEL-Llama-3-epoch_2| 31.3 | 34.2 | 7.83 | 65.4 | 75.4 |
|
35 |
+
| REBEL-Llama-3-Armo-iter_1| 48.3 | 41.8 | 8.13 | 66.3 | 75.8 |
|
36 |
+
| REBEL-Llama-3-Armo-iter_2| 50.0 | 48.5 | 8.07 | 65.9 | 75.4 |
|
37 |
+
| REBEL-Llama-3-Armo-iter_3| 49.7 | 48.1 | 8.01 | 66.0 | 75.7 |
|
38 |
|
39 |
## Citation
|
40 |
Please cite our paper if you use this model in your own work:
|
|
|
47 |
archivePrefix={arXiv},
|
48 |
primaryClass={cs.LG}
|
49 |
}
|
50 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|