yi-01-ai commited on
Commit
a6d67fd
·
1 Parent(s): 9df03e2

Auto Sync from git://github.com/01-ai/Yi.git/commit/2d78f1a452b28ee559a1b5c7deac8b68a315f074

Browse files
Files changed (1) hide show
  1. README.md +90 -8
README.md CHANGED
@@ -68,10 +68,15 @@ developers at [01.AI](https://01.ai/).
68
  ## News
69
 
70
  <details open>
 
 
 
 
 
71
  <summary>🔔 <b>2023/11/15</b>: The commercial licensing agreement for the Yi series models <a href="https://huggingface.co/01-ai/Yi-34B/discussions/28#65546af9198da1df586baaf2">is set to be updated</a>.</summary>
72
  </details>
73
 
74
- <details open>
75
  <summary>🔥 <b>2023/11/08</b>: Invited test of Yi-34B chat model.</summary>
76
 
77
  Application form:
@@ -100,6 +105,7 @@ sequence length and can be extended to 32K during inference time.
100
 
101
  ## Model Performance
102
 
 
103
 
104
  | Model | MMLU | CMMLU | C-Eval | GAOKAO | BBH | Common-sense Reasoning | Reading Comprehension | Math & Code |
105
  | :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: |
@@ -140,6 +146,57 @@ is derived by averaging the scores on the remaining tasks. Since the scores for
140
  these two tasks are generally lower than the average, we believe that
141
  Falcon-180B's performance was not underestimated.
142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
  ## Usage
144
 
145
  Feel free to [create an issue](https://github.com/01-ai/Yi/issues/new) if you
@@ -181,7 +238,36 @@ can also download them manually from the following places:
181
 
182
  ### 3. Examples
183
 
184
- #### 3.1 Use the base model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
  ```bash
187
  python demo/text_generation.py
@@ -238,7 +324,7 @@ The Arctic is a place of great beauty. The ice and snow are a
238
  For more advanced usage, please refer to the
239
  [doc](https://github.com/01-ai/Yi/tree/main/demo).
240
 
241
- #### 3.2 Finetuning from the base model:
242
 
243
  ```bash
244
  bash finetune/scripts/run_sft_Yi_6b.sh
@@ -253,7 +339,7 @@ bash finetune/scripts/run_eval.sh
253
  For more advanced usage like fine-tuning based on your custom data, please refer
254
  the [doc](https://github.com/01-ai/Yi/tree/main/finetune).
255
 
256
- #### 3.3 Quantization
257
 
258
  ##### GPT-Q
259
  ```bash
@@ -306,10 +392,6 @@ the Yi series models.
306
 
307
  ## FAQ
308
 
309
- 1. **Will you release the chat version?**
310
-
311
- Yes, the chat version will be released around the end of November 2023.
312
-
313
  1. **What dataset was this trained with?**
314
 
315
  The dataset we use contains Chinese & English only. We used approximately 3T
 
68
  ## News
69
 
70
  <details open>
71
+ <summary>🎯 <b>2023/11/23</b>: The chat model of <code>Yi-6B-Chat</code>, <code>Yi-34B-Chat</code>, <code>Yi-6B-Chat-8bits</code>, <code>Yi-34B-Chat-8bits</code>, <code>Yi-6B-Chat-4bits</code>, <code>Yi-34B-Chat-4bits</code>.</summary>
72
+ This release contains two chat models based on previous released base models, two 8-bits models quntinized by GPTQ, two 4-bits models quantinized by AWQ.
73
+ </details>
74
+
75
+ <details>
76
  <summary>🔔 <b>2023/11/15</b>: The commercial licensing agreement for the Yi series models <a href="https://huggingface.co/01-ai/Yi-34B/discussions/28#65546af9198da1df586baaf2">is set to be updated</a>.</summary>
77
  </details>
78
 
79
+ <details>
80
  <summary>🔥 <b>2023/11/08</b>: Invited test of Yi-34B chat model.</summary>
81
 
82
  Application form:
 
105
 
106
  ## Model Performance
107
 
108
+ ### Base Model Performance
109
 
110
  | Model | MMLU | CMMLU | C-Eval | GAOKAO | BBH | Common-sense Reasoning | Reading Comprehension | Math & Code |
111
  | :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: |
 
146
  these two tasks are generally lower than the average, we believe that
147
  Falcon-180B's performance was not underestimated.
148
 
149
+ ### Chat Model Performance
150
+
151
+ | Model | MMLU | MMLU | CMMLU | CMMLU | C-Eval(val)<sup>*</sup> | C-Eval(val)<sup>*</sup> | Truthful QA | BBH | BBH | GSM8k | GSM8k |
152
+ | ----------------------- | --------- | --------- | --------- | --------- | ----------------------- | ----------------------- | ----------- | --------- | --------- | --------- | --------- |
153
+ | | 0-shot | 5-shot | 0-shot | 5-shot | 0-shot | 5-shot | 0-shot | 0-shot | 3-shot | 0-shot | 4-shot |
154
+ | LLaMA2-13B-Chat | 50.88 | 47.33 | 27.47 | 35.08 | 27.93 | 35.88 | 36.84 | 32.90 | 58.22 | 36.85 | 2.73 |
155
+ | LLaMA2-70B-Chat | 59.42 | 59.86 | 36.10 | 40.99 | 34.99 | 41.31 | 53.95 | 42.36 | 58.53 | 47.08 | 58.68 |
156
+ | Baichuan2-13B-Chat | 55.09 | 50.14 | 58.64 | 59.47 | 56.02 | 54.75 | 48.98 | 38.81 | 47.15 | 45.72 | 23.28 |
157
+ | Qwen-14B-Chat | 63.99 | 64.98 | 67.73 | 70.57 | 66.12 | 70.06 | 52.49 | 49.65 | 54.98 | 59.51 | 61.18 |
158
+ | InternLM-Chat-20B | 55.55 | 57.42 | 53.55 | 53.75 | 51.19 | 53.57 | 51.75 | 42.41 | 36.68 | 15.69 | 43.44 |
159
+ | AquilaChat2-34B v1.2 | 65.15 | 66.70 | 67.51 | 70.02 | **82.99** | **89.38** | **64.33** | 20.12 | 34.28 | 11.52 | 48.45 |
160
+ | Yi-6B-Chat | 58.24 | 60.99 | 69.44 | 74.71 | 68.80 | 74.22 | 50.58 | 39.70 | 47.15 | 38.44 | 44.88 |
161
+ | Yi-6B-Chat-8bits(GPTQ) | 58.29 | 60.96 | 69.21 | 74.69 | 69.17 | 73.85 | 49.85 | 40.35 | 47.26 | 39.42 | 44.88 |
162
+ | Yi-6B-Chat-4bits(AWQ) | 56.78 | 59.89 | 67.70 | 73.29 | 67.53 | 72.29 | 50.29 | 37.74 | 43.62 | 35.71 | 38.36 |
163
+ | Yi-34B-Chat | **67.62** | 73.46 | **79.11** | **81.34** | 77.04 | 78.53 | 62.43 | 51.41 | **71.74** | **71.65** | **75.97** |
164
+ | Yi-34B-Chat-8bits(GPTQ) | 66.24 | **73.69** | 79.05 | 81.23 | 76.82 | 78.97 | 61.84 | **52.08** | 70.97 | 70.74 | 75.74 |
165
+ | Yi-34B-Chat-4bits(AWQ) | 65.77 | 72.42 | 78.21 | 80.50 | 75.71 | 77.27 | 61.84 | 48.30 | 69.39 | 70.51 | 74.00 |
166
+
167
+ We evaluated various benchmarks using both zero-shot and few-shot methods, except for TruthfulQA. Generally, the zero-shot approach is more common in chat models. Our evaluation strategy involves generating responses while following instructions explicitly or implicitly (such as using few-shot examples). We then isolate relevant answers from the generated text. Some models are not well-suited to produce output in the specific format required by instructions in few datasets, which leads to suboptimal results.
168
+
169
+ <strong>*</strong>: C-Eval results are evaluated on the validation datasets
170
+
171
+ ### Quantized Chat Model Performance
172
+
173
+ We also provide both 4-bit (AWQ) and 8-bit (GPTQ) quantized Yi chat models. Evaluation results on various benchmarks have shown that the quantized models have negligible losses. Additionally, they reduce the memory footprint size. After testing different configurations of prompts and generation lengths, we highly recommend following the guidelines in the memory footprint table below when selecting a device to run our models.
174
+
175
+ | | batch=1 | batch=4 | batch=16 | batch=32 |
176
+ | ----------------------- | ------- | ------- | -------- | -------- |
177
+ | Yi-34B-Chat | 65GiB | 68GiB | 76GiB | >80GiB |
178
+ | Yi-34B-Chat-8bits(GPTQ) | 35GiB | 37GiB | 46GiB | 58GiB |
179
+ | Yi-34B-Chat-4bits(AWQ) | 19GiB | 20GiB | 30GiB | 40GiB |
180
+ | Yi-6B-Chat | 12GiB | 13GiB | 15GiB | 18GiB |
181
+ | Yi-6B-Chat-8bits(GPTQ) | 7GiB | 8GiB | 10GiB | 14GiB |
182
+ | Yi-6B-Chat-4bits(AWQ) | 4GiB | 5GiB | 7GiB | 10GiB |
183
+
184
+ Note: All the numbers in the table represent the minimum recommended memory for running models of the corresponding size.
185
+
186
+ ### Limitations of Chat Model
187
+
188
+ The released chat model has undergone exclusive training using Supervised Fine-Tuning (SFT). Compared to other standard chat models, our model produces more diverse responses, making it suitable for various downstream tasks, such as creative scenarios. Furthermore, this diversity is expected to enhance the likelihood of generating higher quality responses, which will be advantageous for subsequent Reinforcement Learning (RL) training.
189
+
190
+ However, this higher diversity might amplify certain existing issues, including:
191
+
192
+ - **Hallucination**: This refers to the model generating factually incorrect or nonsensical information. With the model's responses being more varied, there's a higher chance of hallucination that are not based on accurate data or logical reasoning.
193
+ - **Non-determinism in re-generation**: When attempting to regenerate or sample responses, inconsistencies in the outcomes may occur. The increased diversity can lead to varying results even under similar input conditions.
194
+ - **Cumulative Error**: This occurs when errors in the model's responses compound over time. As the model generates more diverse responses, the likelihood of small inaccuracies building up into larger errors increases, especially in complex tasks like extended reasoning, mathematical problem-solving, etc.
195
+
196
+ To achieve more coherent and consistent responses, it is advisable to adjust generation configuration parameters such as`temperature`,`top_p`, or`top_k`. These adjustments can help in the balance between creativity and coherence in the model's outputs.
197
+
198
+
199
+
200
  ## Usage
201
 
202
  Feel free to [create an issue](https://github.com/01-ai/Yi/issues/new) if you
 
238
 
239
  ### 3. Examples
240
 
241
+ #### 3.1 Use the chat model
242
+
243
+ ```python
244
+ from transformers import AutoModelForCausalLM, AutoTokenizer
245
+
246
+ model_path = '01-ai/Yi-34b-Chat'
247
+
248
+ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
249
+
250
+ # Since transformers 4.35.0, the GPT-Q/AWQ model can be loaded using AutoModelForCausalLM.
251
+ model = AutoModelForCausalLM.from_pretrained(
252
+ model_path,
253
+ device_map="auto",
254
+ torch_dtype='auto'
255
+ ).eval()
256
+
257
+ # Prompt content: "hi"
258
+ messages = [
259
+ {"role": "user", "content": "hi"}
260
+ ]
261
+
262
+ input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt')
263
+ output_ids = model.generate(input_ids.to('cuda'))
264
+ response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
265
+
266
+ # Model response: "Hello! How can I assist you today?"
267
+ print(response)
268
+ ```
269
+
270
+ #### 3.2 Use the base model
271
 
272
  ```bash
273
  python demo/text_generation.py
 
324
  For more advanced usage, please refer to the
325
  [doc](https://github.com/01-ai/Yi/tree/main/demo).
326
 
327
+ #### 3.3 Finetuning from the base model:
328
 
329
  ```bash
330
  bash finetune/scripts/run_sft_Yi_6b.sh
 
339
  For more advanced usage like fine-tuning based on your custom data, please refer
340
  the [doc](https://github.com/01-ai/Yi/tree/main/finetune).
341
 
342
+ #### 3.4 Quantization
343
 
344
  ##### GPT-Q
345
  ```bash
 
392
 
393
  ## FAQ
394
 
 
 
 
 
395
  1. **What dataset was this trained with?**
396
 
397
  The dataset we use contains Chinese & English only. We used approximately 3T