RangiLyu commited on
Commit
154ff6b
1 Parent(s): 63e9f7f

model rename

Browse files
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
  <img src="https://github.com/InternLM/InternLM/assets/22529082/b9788105-8892-4398-8b47-b513a292378e" width="200"/>
15
  <div>&nbsp;</div>
16
  <div align="center">
17
- <b><font size="5">InternLM Reward</font></b>
18
  </div>
19
 
20
 
@@ -29,22 +29,22 @@ tags:
29
 
30
  ## Introduction
31
 
32
- **InternLM-Reward** is a reward model trained on the foundation of InternLM2-Chat-SFT. This model has been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpful and harmless.
33
 
34
  ### Key Features:
35
- - **Variety of Sizes Available**: Our open-sourced reward models are available in sizes of **1.8B, 7B, and 20B**, each demonstrating exceptional performance across various metrics.
36
  - **Comprehensive Coverage of Preference**: Trained with **2.4 million** preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
37
- - **Multilingual Support**: InternLM-Reward was trained on high-quality **English and Chinese** preference data, delivering robust performance in both languages.
38
 
39
- This model was applied to the PPO training process of InternLM2-Chat. The reward model training techniques from the [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297) have been open-sourced in XTuner, try it out [here](https://github.com/InternLM/xtuner)!
40
 
41
  ## Performance Evaluation on RewardBench
42
 
43
  | Models | Score | Chat | Chat Hard | Safety | Reasoning |
44
  | --- | --- | --- | --- | --- | --- |
45
- | InternLM-Reward-20B | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
46
- | InternLM-Reward-7B | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
47
- | InternLM-Reward-1.8B | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
48
 
49
  - The evaluation is conducted on the [RewardBench](https://github.com/allenai/reward-bench) dataset.
50
  - For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.
@@ -60,12 +60,12 @@ import torch
60
  from transformers import AutoModel, AutoTokenizer
61
 
62
  model = AutoModel.from_pretrained(
63
- "internlm/internlm-reward-7b",
64
  device_map="cuda",
65
  torch_dtype=torch.float16,
66
  trust_remote_code=True,
67
  )
68
- tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
69
 
70
  chat_1 = [
71
  {"role": "user", "content": "Hello! What's your name?"},
@@ -125,12 +125,12 @@ llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trus
125
 
126
  # prepare the reward model and tokenizer
127
  reward = AutoModel.from_pretrained(
128
- "internlm/internlm-reward-7b",
129
  device_map="cuda",
130
  torch_dtype=torch.float16,
131
  trust_remote_code=True,
132
  )
133
- reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
134
 
135
  # prepare the chat prompt
136
  prompt = "Write an article about the artificial intelligence revolution."
@@ -191,12 +191,12 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
191
  ```
192
  ## 简介
193
 
194
- **InternLM-Reward** 是基于 **InternLM2-Chat-SFT** 训练的奖励模型。该模型使用超过 240 万条人工标注和 AI 合成的偏好样本,覆盖了包括对话、写作、诗歌、总结、编码和数学等多个领域。在取得了出色性能的同时也兼顾了实用性和安全性偏好的平衡。
195
 
196
- ### InternLM-Reward 的主要特点:
197
- - **多种尺寸可供选择**:我们开源的奖励模型有 1.8B、7B 和 20B 三种尺寸,每种尺寸都展示出了卓越的性能。
198
- - **全面覆盖偏好**:模型训练了 240 万条来自人工标注和AI合成的偏好样本,涉及对话、写作、诗歌、总结、编码和数学等多个领域,同时确保了实用性和安全性偏好的平衡。
199
- - **多语言支持**:InternLM-Reward 在高质量的**英文和中文**偏好数据上进行训练,确保了在这两种语言上都有稳健的表现。
200
 
201
  该模型运用在了 InternLM2-Chat 的 PPO 训练过程中。我们的[技术报告](https://arxiv.org/abs/2403.17297)中提出的 Reward Model 训练技巧已在 XTuner 中公开。欢迎点击[链接](https://github.com/InternLM/xtuner)进行尝试!
202
 
@@ -204,9 +204,9 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
204
 
205
  | Models | Score | Chat | Chat Hard | Safety | Reasoning |
206
  | --- | --- | --- | --- | --- | --- |
207
- | InternLM-Reward-20B | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
208
- | InternLM-Reward-7B | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
209
- | InternLM-Reward-1.8B | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
210
 
211
  - 评估使用了 [RewardBench](https://github.com/allenai/reward-bench) 数据集进行。
212
  - 为了公平比较,测试期间没有使用我们技术报告中提出的"条件系统提示"。
@@ -215,19 +215,19 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
215
 
216
  ### 基本用法
217
 
218
- 我们为您提供了一些用户友好的 API 以便使用该模型。以下是一些示例,展示如何使用 InternLM-Reward 获取聊天的奖励分数、比较两组对话或对多个对话进行排名。
219
 
220
  ```python
221
  import torch
222
  from transformers import AutoModel, AutoTokenizer
223
 
224
  model = AutoModel.from_pretrained(
225
- "internlm/internlm-reward-7b",
226
  device_map="cuda",
227
  torch_dtype=torch.float16,
228
  trust_remote_code=True,
229
  )
230
- tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
231
 
232
  chat_1 = [
233
  {"role": "user", "content": "Hello! What's your name?"},
@@ -269,7 +269,7 @@ print("rank_res: ", rank_res) # 排名序号越低表示分数越高
269
 
270
  ### Best of N 采样
271
 
272
- 以下是如何使用 InternLM-Reward 执行Best of N 采样的示例。
273
  以下代码演示了如何从语言模型生成的候选回答中选择最佳回答。
274
 
275
  ```python
@@ -287,12 +287,12 @@ llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trus
287
 
288
  # 准备奖励模型和分词器
289
  reward = AutoModel.from_pretrained(
290
- "internlm/internlm-reward-7b",
291
  device_map="cuda",
292
  torch_dtype=torch.float16,
293
  trust_remote_code=True,
294
  )
295
- reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
296
 
297
  # 准备提示词
298
  prompt = "Write an article about the artificial intelligence revolution."
 
14
  <img src="https://github.com/InternLM/InternLM/assets/22529082/b9788105-8892-4398-8b47-b513a292378e" width="200"/>
15
  <div>&nbsp;</div>
16
  <div align="center">
17
+ <b><font size="5">InternLM2-1.8B-Reward</font></b>
18
  </div>
19
 
20
 
 
29
 
30
  ## Introduction
31
 
32
+ **InternLM2-1.8B-Reward** is a reward model trained on the foundation of InternLM2-Chat-1.8B-SFT. This model has been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpful and harmless.
33
 
34
  ### Key Features:
35
+ - **Variety of Sizes Available**: Our open-sourced reward models are available in sizes of **1.8B, 7B, and 20B**, each demonstrating exceptional performance across various metrics. We aim for these different-sized models to facilitate research on the scaling laws of reward models, providing valuable insights to the community.
36
  - **Comprehensive Coverage of Preference**: Trained with **2.4 million** preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
37
+ - **Multilingual Support**: InternLM2-Reward was trained on high-quality **English and Chinese** preference data, delivering robust performance in both languages.
38
 
39
+ This model was applied to the RLHF training process of InternLM2-Chat. The reward model training techniques from the [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297) have been open-sourced in XTuner, try it out [here](https://github.com/InternLM/xtuner)!
40
 
41
  ## Performance Evaluation on RewardBench
42
 
43
  | Models | Score | Chat | Chat Hard | Safety | Reasoning |
44
  | --- | --- | --- | --- | --- | --- |
45
+ | InternLM2-20B-Reward | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
46
+ | InternLM2-7B-Reward | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
47
+ | InternLM2-1.8B-Reward | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
48
 
49
  - The evaluation is conducted on the [RewardBench](https://github.com/allenai/reward-bench) dataset.
50
  - For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.
 
60
  from transformers import AutoModel, AutoTokenizer
61
 
62
  model = AutoModel.from_pretrained(
63
+ "internlm/internlm2-1_8b-reward",
64
  device_map="cuda",
65
  torch_dtype=torch.float16,
66
  trust_remote_code=True,
67
  )
68
+ tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)
69
 
70
  chat_1 = [
71
  {"role": "user", "content": "Hello! What's your name?"},
 
125
 
126
  # prepare the reward model and tokenizer
127
  reward = AutoModel.from_pretrained(
128
+ "internlm/internlm2-1_8b-reward",
129
  device_map="cuda",
130
  torch_dtype=torch.float16,
131
  trust_remote_code=True,
132
  )
133
+ reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)
134
 
135
  # prepare the chat prompt
136
  prompt = "Write an article about the artificial intelligence revolution."
 
191
  ```
192
  ## 简介
193
 
194
+ **InternLM2-1.8B-Reward** 是基于 **InternLM2-Chat-1.8B-SFT** 训练的奖励模型。该模型使用超过 240 万条人工标注和 AI 合成的偏好样本,覆盖了包括对话、写作、诗歌、总结、编码和数学等多个领域。在取得了出色性能的同时也兼顾了实用性和安全性偏好的平衡。
195
 
196
+ ### InternLM2-Reward 的主要特点:
197
+ - **多种尺寸可供选择**:我们开源的奖励模型有 **1.8B、7B 和 20B** 三种尺寸,每种尺寸都展示出了卓越的性能。我们希望这些不同大小的模型能够促进社区关于 Reward Model 缩放定律的研究。
198
+ - **全面覆盖偏好**:模型训练了 **240 万**条来自人工标注和AI合成的偏好样本,涉及对话、写作、诗歌、总结、编码和数学等多个领域,同时确保了实用性和安全性偏好的平衡。
199
+ - **多语言支持**:InternLM2-Reward 在高质量的**英文和中文**偏好数据上进行训练,确保了在这两种语言上都有稳健的表现。
200
 
201
  该模型运用在了 InternLM2-Chat 的 PPO 训练过程中。我们的[技术报告](https://arxiv.org/abs/2403.17297)中提出的 Reward Model 训练技巧已在 XTuner 中公开。欢迎点击[链接](https://github.com/InternLM/xtuner)进行尝试!
202
 
 
204
 
205
  | Models | Score | Chat | Chat Hard | Safety | Reasoning |
206
  | --- | --- | --- | --- | --- | --- |
207
+ | InternLM2-20B-Reward | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
208
+ | InternLM2-7B-Reward | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
209
+ | InternLM2-1.8B-Reward | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
210
 
211
  - 评估使用了 [RewardBench](https://github.com/allenai/reward-bench) 数据集进行。
212
  - 为了公平比较,测试期间没有使用我们技术报告中提出的"条件系统提示"。
 
215
 
216
  ### 基本用法
217
 
218
+ 我们为您提供了一些用户友好的 API 以便使用该模型。以下是一些示例,展示如何使用 InternLM2-Reward 获取对话的奖励分数、比较两组对话或对多个对话进行排名。
219
 
220
  ```python
221
  import torch
222
  from transformers import AutoModel, AutoTokenizer
223
 
224
  model = AutoModel.from_pretrained(
225
+ "internlm/internlm2-1_8b-reward",
226
  device_map="cuda",
227
  torch_dtype=torch.float16,
228
  trust_remote_code=True,
229
  )
230
+ tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)
231
 
232
  chat_1 = [
233
  {"role": "user", "content": "Hello! What's your name?"},
 
269
 
270
  ### Best of N 采样
271
 
272
+ 以下是如何使用 InternLM2-Reward 执行Best of N 采样的示例。
273
  以下代码演示了如何从语言模型生成的候选回答中选择最佳回答。
274
 
275
  ```python
 
287
 
288
  # 准备奖励模型和分词器
289
  reward = AutoModel.from_pretrained(
290
+ "internlm/internlm2-1_8b-reward",
291
  device_map="cuda",
292
  torch_dtype=torch.float16,
293
  trust_remote_code=True,
294
  )
295
+ reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)
296
 
297
  # 准备提示词
298
  prompt = "Write an article about the artificial intelligence revolution."
reward_bench_results/eval-set/{internlm-reward-1_8b.json → internlm2-1_8b-reward.json} RENAMED
@@ -16,7 +16,7 @@
16
  "llmbar-adver-neighbor": 0.4626865671641791,
17
  "llmbar-natural": 0.88,
18
  "math-prm": 0.930648769574944,
19
- "model": "internlm/internlm-reward-1_8b",
20
  "model_type": "Seq. Classifier",
21
  "mt-bench-easy": 0.9642857142857143,
22
  "mt-bench-hard": 0.7297297297297297,
 
16
  "llmbar-adver-neighbor": 0.4626865671641791,
17
  "llmbar-natural": 0.88,
18
  "math-prm": 0.930648769574944,
19
+ "model": "internlm/internlm2-1_8b-reward",
20
  "model_type": "Seq. Classifier",
21
  "mt-bench-easy": 0.9642857142857143,
22
  "mt-bench-hard": 0.7297297297297297,
reward_bench_results/pref-sets/internlm-reward-1_8b.json DELETED
File without changes
reward_bench_results/pref-sets/internlm2-1_8b-reward.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "anthropic_harmless": 0.6932921447484555,
3
+ "anthropic_helpful": 0.6871770025839793,
4
+ "anthropic_hhh": 0.8190045248868778,
5
+ "chat_template": "tokenizer",
6
+ "model": "internlm/internlm2-1_8b-reward",
7
+ "model_type": "Seq. Classifier",
8
+ "mtbench_gpt4": 0.9058333333333334,
9
+ "mtbench_human": 0.7415797317436662,
10
+ "shp": 0.630097645031591,
11
+ "summarize": 0.6773333333333333
12
+ }