File size: 4,462 Bytes
3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d 5519e53 3c00a1d f3a61bf 3c00a1d 0656b31 3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d 81b58a2 f3a61bf 3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d 102b5ac 3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d f3a61bf 3c00a1d b3d02d9 3c00a1d f3a61bf 3c00a1d e8adab3 3c00a1d f3a61bf e8adab3 f3a61bf 3c00a1d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
{}
---
# Reward Model Overview
<!-- Provide a quick summary of what the model is/does. -->
The reward model is trained from the base model [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
The training script is available at https://github.com/WeiXiongUST/RLHF-Reward-Modeling .
Also see a short blog for the training details (data mixture, parameters...): https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0
## Model Details
If you have any question with this reward model and also any question about reward modeling, feel free to drop me an email with [email protected]. I would be happy to chat!
### Dataset preprocessing
<!-- Provide a longer summary of what this model is. -->
The model is trained on a mixture of the following datasets. We also provide the mixture in [weqweasdas/preference_dataset_mixture2_and_safe_pku](https://huggingface.co/datasets/weqweasdas/preference_dataset_mixture2_and_safe_pku).
- [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [SHP](https://huggingface.co/datasets/stanfordnlp/SHP)
- [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
- [Capybara](argilla/distilabel-capybara-dpo-7k-binarized)
- [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
- [Orca](argilla/distilabel-intel-orca-dpo-pairs)
Difference between this mixture and that of
- SHP: we only use the samples with score ratio > 2, for each prompt, we take 5 comparison at most, leading to 109526;
- Ultrafeedback: similar to UltraFeedback-Binarized, we use the fine-grained score instead of the overall one to rank samples. Meanwhile, for each prompt, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 267416.
- HelpSteer: we use the mean of helpfulness and correctness to rank samples. Meanwhile, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 21576;
### Training
We train the model for one epoch with a learning rate of 5e-6, batch size 512, cosine learning rate decay with a warmup ratio 0.03.
## Uses
```python
from transformers import AutoTokenizer, pipeline
rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/RM-Mistral-7B")
device = 0 # accelerator.device
rm_pipe = pipeline(
"sentiment-analysis",
model="weqweasdas/RM-Mistral-7B",
#device="auto",
device=device,
tokenizer=rm_tokenizer,
model_kwargs={"torch_dtype": torch.bfloat16}
)
pipe_kwargs = {
"return_all_scores": True,
"function_to_apply": "none",
"batch_size": 1
}
chat = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
test_texts = [tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False).replace(tokenizer.bos_token, "")]
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
rewards = [output[0]["score"] for output in pipe_outputs]
```
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
## Results
The reward model ranks 2nd in the [RewardBench](https://huggingface.co/spaces/allenai/reward-bench)
## Reference
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:
```
@article{dong2023raft,
title={Raft: Reward ranked finetuning for generative foundation model alignment},
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
journal={arXiv preprint arXiv:2304.06767},
year={2023}
}
@misc{xiong2024iterative,
title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
year={2024},
eprint={2312.11456},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
|