weqweasdas
commited on
Commit
•
e8adab3
1
Parent(s):
102b5ac
Update README.md
Browse files
README.md
CHANGED
@@ -86,7 +86,7 @@ The reward model ranks 2nd in the [RewardBench](https://huggingface.co/spaces/al
|
|
86 |
|
87 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
88 |
|
89 |
-
|
90 |
|
91 |
|
92 |
```
|
@@ -96,6 +96,15 @@ To be added. The reward model may be readily used for rejection sampling finetun
|
|
96 |
journal={arXiv preprint arXiv:2304.06767},
|
97 |
year={2023}
|
98 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
```
|
100 |
|
101 |
|
|
|
86 |
|
87 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
88 |
|
89 |
+
The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:
|
90 |
|
91 |
|
92 |
```
|
|
|
96 |
journal={arXiv preprint arXiv:2304.06767},
|
97 |
year={2023}
|
98 |
}
|
99 |
+
|
100 |
+
@misc{xiong2024iterative,
|
101 |
+
title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
|
102 |
+
author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
|
103 |
+
year={2024},
|
104 |
+
eprint={2312.11456},
|
105 |
+
archivePrefix={arXiv},
|
106 |
+
primaryClass={cs.LG}
|
107 |
+
}
|
108 |
```
|
109 |
|
110 |
|