Muennighoff commited on
Commit
322cbbc
1 Parent(s): 8a7ab6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +304 -63
README.md CHANGED
@@ -1,4 +1,16 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
  - zh
4
  - en
@@ -6,97 +18,326 @@ tags:
6
  - codegeex
7
  - glm
8
  - chatglm
9
- - thudm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- ![](resources/codegeex_logo.png)
13
 
14
- <p align="center">
15
- 🏠 <a href="https://codegeex.cn" target="_blank">Homepage</a>|💻 <a href="https://github.com/THUDM/CodeGeeX2" target="_blank">GitHub</a>|🛠 Tools <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code</a>, <a href="https://plugins.jetbrains.com/plugin/20587-codegeex" target="_blank">Jetbrains</a>|🤗 <a href="https://huggingface.co/THUDM/codegeex2-6b" target="_blank">HF Repo</a>|📄 <a href="https://arxiv.org/abs/2303.17568" target="_blank">Paper</a>
16
- </p>
17
 
18
- <p align="center">
19
- 👋 Join our <a href="https://discord.gg/8gjHdkmAN6" target="_blank">Discord</a>, <a href="https://join.slack.com/t/codegeexworkspace/shared_invite/zt-1s118ffrp-mpKKhQD0tKBmzNZVCyEZLw" target="_blank">Slack</a>, <a href="https://t.me/+IipIayJ32B1jOTg1" target="_blank">Telegram</a>, <a href="https://github.com/THUDM/CodeGeeX2/blob/main/resources/wechat.md"target="_blank">WeChat</a>
20
- </p>
21
 
22
- INT4量化版本|INT4 quantized version [codegeex2-6b-int4](https://huggingface.co/THUDM/codegeex2-6b-int4)
23
 
24
- # CodeGeeX2: 更强大的多语言代码生成模型
25
- # A More Powerful Multilingual Code Generation Model
 
 
 
 
26
 
27
- CodeGeeX2 是多语言代码生成模型 [CodeGeeX](https://github.com/THUDM/CodeGeeX) ([KDD’23](https://arxiv.org/abs/2303.17568)) 的第二代模型。CodeGeeX2 基于 [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) 架构加入代码预训练实现,得益于 ChatGLM2 的更优性能,CodeGeeX2 在多项指标上取得性能提升(+107% > CodeGeeX;仅60亿参数即超过150亿参数的 StarCoder-15B 近10%),更多特性包括:
28
 
29
- * **更强大的代码能力**:基于 ChatGLM2-6B 基座语言模型,CodeGeeX2-6B 进一步经过了 600B 代码数据预训练,相比一代模型,在代码能力上全面提升,[HumanEval-X](https://huggingface.co/datasets/THUDM/humaneval-x) 评测集的六种编程语言均大幅提升 (Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, Rust +321\%),在Python上达到 35.9\% 的 Pass@1 一次通过率,超越规模更大的 StarCoder-15B。
30
- * **更优秀的模型特性**:继承 ChatGLM2-6B 模型特性,CodeGeeX2-6B 更好支持中英文输入,支持最大 8192 序列长度,推理速度较一代 CodeGeeX-13B 大幅提升,量化后仅需6GB显存即可运行,支持轻量级本地化部署。
31
- * **更全面的AI编程助手**:CodeGeeX插件([VS Code](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex), [Jetbrains](https://plugins.jetbrains.com/plugin/20587-codegeex))后端升级,支持超过100种编程语言,新增上下文补全、跨文件补全等实用功能。结合 Ask CodeGeeX 交互式AI编程助手,支持中英文对话解决各种编程问题,包括且不限于代码解释、代码翻译、代码纠错、文档生成等,帮助程序员更高效开发。
32
- * **更开放的协议**:CodeGeeX2-6B 权重对学术研究完全开放,填写[登记表](https://open.bigmodel.cn/mla/form?mcode=CodeGeeX2-6B)申请商业使用。
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- CodeGeeX2 is the second-generation model of the multilingual code generation model [CodeGeeX](https://github.com/THUDM/CodeGeeX) ([KDD’23](https://arxiv.org/abs/2303.17568)), which is implemented based on the [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) architecture trained on more code data. Due to the advantage of ChatGLM2, CodeGeeX2 has been comprehensively improved in coding capability (+107% > CodeGeeX; with only 6B parameters, surpassing larger StarCoder-15B for some tasks). It has the following features:
36
 
37
- * **More Powerful Coding Capabilities**: Based on the ChatGLM2-6B model, CodeGeeX2-6B has been further pre-trained on 600B code tokens, which has been comprehensively improved in coding capability compared to the first-generation. On the [HumanEval-X](https://huggingface.co/datasets/THUDM/humaneval-x) benchmark, all six languages have been significantly improved (Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, Rust +321\%), and in Python it reached 35.9% of Pass@1 one-time pass rate, surpassing the larger StarCoder-15B.
38
- * **More Useful Features**: Inheriting the ChatGLM2-6B model features, CodeGeeX2-6B better supports both Chinese and English prompts, maximum 8192 sequence length, and the inference speed is significantly improved compared to the first-generation. After quantization, it only needs 6GB of GPU memory for inference, thus supports lightweight local deployment.
39
- * **Comprehensive AI Coding Assistant**: The backend of CodeGeeX plugin ([VS Code](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex), [Jetbrains](https://plugins.jetbrains.com/plugin/20587-codegeex)) is upgraded, supporting 100+ programming languages, and adding practical functions such as infilling and cross-file completion. Combined with the "Ask CodeGeeX" interactive AI coding assistant, it can be used to solve various programming problems via Chinese or English dialogue, including but not limited to code summarization, code translation, debugging, and comment generation, which helps increasing the efficiency of developpers.
40
- * **Open Liscense**: CodeGeeX2-6B weights are fully open to academic research, and please apply for commercial use by filling in the [registration form](https://open.bigmodel.cn/mla/form?mcode=CodeGeeX2-6B).
41
 
 
42
 
43
- ## 软件依赖 Dependency
44
 
45
- ```shell
46
- pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate
47
- ```
48
-
49
- ## 快速开始 | Get Started
50
 
 
51
  ```python
52
- from transformers import AutoTokenizer, AutoModel
53
- tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
54
- model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True, device='cuda')
55
- model = model.eval()
56
 
57
- # remember adding a language tag for better performance
58
- prompt = "# language: python\n# write a bubble sort function\n"
59
- inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
60
- outputs = model.generate(inputs, max_length=256, top_k=1)
61
- response = tokenizer.decode(outputs[0])
62
 
63
- >>> print(response)
64
- # language: python
65
- # write a bubble sort function
66
 
 
 
 
 
67
 
68
- def bubble_sort(list):
69
- for i in range(len(list) - 1):
70
- for j in range(len(list) - 1):
71
- if list[j] > list[j + 1]:
72
- list[j], list[j + 1] = list[j + 1], list[j]
73
- return list
74
 
 
75
 
76
- print(bubble_sort([5, 2, 4, 6, 1, 3]))
77
- ```
 
 
78
 
79
- 关于更多的使用说明,请参考 CodeGeeX2 的 [Github Repo](https://github.com/THUDM/CodeGeeX2)。
80
 
81
- For more information, please refer to CodeGeeX2's [Github Repo](https://github.com/THUDM/CodeGeeX2).
 
 
 
 
 
82
 
83
- ## 协议 | License
84
 
85
- 本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源,模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。
 
86
 
87
- The code in this repository is open source under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license. The model weights are licensed under the [Model License](MODEL_LICENSE).
88
 
89
- ## 引用 Citation
90
 
91
- 如果觉得我们的工作有帮助,欢迎引用以下论文:
92
 
93
- If you find our work helpful, please feel free to cite the following paper:
94
 
95
- ```
96
- @inproceedings{zheng2023codegeex,
97
- title={CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X},
98
- author={Qinkai Zheng and Xiao Xia and Xu Zou and Yuxiao Dong and Shan Wang and Yufei Xue and Zihan Wang and Lei Shen and Andi Wang and Yang Li and Teng Su and Zhilin Yang and Jie Tang},
99
- booktitle={KDD},
100
- year={2023}
101
- }
102
- ```
 
1
  ---
2
+ pipeline_tag: text-generation
3
+ inference: true
4
+ widget:
5
+ - text: 'def print_hello_world():'
6
+ example_title: Hello world
7
+ group: Python
8
+ datasets:
9
+ - bigcode/commitpackft
10
+ - bigcode/oasst-octopack
11
+ metrics:
12
+ - code_eval
13
+ library_name: transformers
14
  language:
15
  - zh
16
  - en
 
18
  - codegeex
19
  - glm
20
  - chatglm
21
+ model-index:
22
+ - name: OctoGeeX
23
+ results:
24
+ - task:
25
+ type: text-generation
26
+ dataset:
27
+ type: bigcode/humanevalpack
28
+ name: HumanEvalSynthesize Python
29
+ metrics:
30
+ - name: pass@1
31
+ type: pass@1
32
+ value: 46.2
33
+ verified: false
34
+ - task:
35
+ type: text-generation
36
+ dataset:
37
+ type: bigcode/humanevalpack
38
+ name: HumanEvalSynthesize JavaScript
39
+ metrics:
40
+ - name: pass@1
41
+ type: pass@1
42
+ value: 39.2
43
+ verified: false
44
+ - task:
45
+ type: text-generation
46
+ dataset:
47
+ type: bigcode/humanevalpack
48
+ name: HumanEvalSynthesize Java
49
+ metrics:
50
+ - name: pass@1
51
+ type: pass@1
52
+ value: 38.2
53
+ verified: false
54
+ - task:
55
+ type: text-generation
56
+ dataset:
57
+ type: bigcode/humanevalpack
58
+ name: HumanEvalSynthesize Go
59
+ metrics:
60
+ - name: pass@1
61
+ type: pass@1
62
+ value: 30.4
63
+ verified: false
64
+ - task:
65
+ type: text-generation
66
+ dataset:
67
+ type: bigcode/humanevalpack
68
+ name: HumanEvalSynthesize C++
69
+ metrics:
70
+ - name: pass@1
71
+ type: pass@1
72
+ value: 35.6
73
+ verified: false
74
+ - task:
75
+ type: text-generation
76
+ dataset:
77
+ type: bigcode/humanevalpack
78
+ name: HumanEvalSynthesize Rust
79
+ metrics:
80
+ - name: pass@1
81
+ type: pass@1
82
+ value: 23.4
83
+ verified: false
84
+ - task:
85
+ type: text-generation
86
+ dataset:
87
+ type: bigcode/humanevalpack
88
+ name: HumanEvalSynthesize Average
89
+ metrics:
90
+ - name: pass@1
91
+ type: pass@1
92
+ value: 35.5
93
+ verified: false
94
+ - task:
95
+ type: text-generation
96
+ dataset:
97
+ type: bigcode/humanevalpack
98
+ name: HumanEvalFix Python
99
+ metrics:
100
+ - name: pass@1
101
+ type: pass@1
102
+ value: 30.2
103
+ verified: false
104
+ - task:
105
+ type: text-generation
106
+ dataset:
107
+ type: bigcode/humanevalpack
108
+ name: HumanEvalFix JavaScript
109
+ metrics:
110
+ - name: pass@1
111
+ type: pass@1
112
+ value: 28.4
113
+ verified: false
114
+ - task:
115
+ type: text-generation
116
+ dataset:
117
+ type: bigcode/humanevalpack
118
+ name: HumanEvalFix Java
119
+ metrics:
120
+ - name: pass@1
121
+ type: pass@1
122
+ value: 30.6
123
+ verified: false
124
+ - task:
125
+ type: text-generation
126
+ dataset:
127
+ type: bigcode/humanevalpack
128
+ name: HumanEvalFix Go
129
+ metrics:
130
+ - name: pass@1
131
+ type: pass@1
132
+ value: 30.2
133
+ verified: false
134
+ - task:
135
+ type: text-generation
136
+ dataset:
137
+ type: bigcode/humanevalpack
138
+ name: HumanEvalFix C++
139
+ metrics:
140
+ - name: pass@1
141
+ type: pass@1
142
+ value: 26.1
143
+ verified: false
144
+ - task:
145
+ type: text-generation
146
+ dataset:
147
+ type: bigcode/humanevalpack
148
+ name: HumanEvalFix Rust
149
+ metrics:
150
+ - name: pass@1
151
+ type: pass@1
152
+ value: 16.5
153
+ verified: false
154
+ - task:
155
+ type: text-generation
156
+ dataset:
157
+ type: bigcode/humanevalpack
158
+ name: HumanEvalFix Average
159
+ metrics:
160
+ - name: pass@1
161
+ type: pass@1
162
+ value: 27.0
163
+ verified: false
164
+ - task:
165
+ type: text-generation
166
+ dataset:
167
+ type: bigcode/humanevalpack
168
+ name: HumanEvalExplain Python
169
+ metrics:
170
+ - name: pass@1
171
+ type: pass@1
172
+ value: 35.1
173
+ verified: false
174
+ - task:
175
+ type: text-generation
176
+ dataset:
177
+ type: bigcode/humanevalpack
178
+ name: HumanEvalExplain JavaScript
179
+ metrics:
180
+ - name: pass@1
181
+ type: pass@1
182
+ value: 24.5
183
+ verified: false
184
+ - task:
185
+ type: text-generation
186
+ dataset:
187
+ type: bigcode/humanevalpack
188
+ name: HumanEvalExplain Java
189
+ metrics:
190
+ - name: pass@1
191
+ type: pass@1
192
+ value: 27.3
193
+ verified: false
194
+ - task:
195
+ type: text-generation
196
+ dataset:
197
+ type: bigcode/humanevalpack
198
+ name: HumanEvalExplain Go
199
+ metrics:
200
+ - name: pass@1
201
+ type: pass@1
202
+ value: 21.1
203
+ verified: false
204
+ - task:
205
+ type: text-generation
206
+ dataset:
207
+ type: bigcode/humanevalpack
208
+ name: HumanEvalExplain C++
209
+ metrics:
210
+ - name: pass@1
211
+ type: pass@1
212
+ value: 24.1
213
+ verified: false
214
+ - task:
215
+ type: text-generation
216
+ dataset:
217
+ type: bigcode/humanevalpack
218
+ name: HumanEvalExplain Rust
219
+ metrics:
220
+ - name: pass@1
221
+ type: pass@1
222
+ value: 14.8
223
+ verified: false
224
+ - task:
225
+ type: text-generation
226
+ dataset:
227
+ type: bigcode/humanevalpack
228
+ name: HumanEvalExplain Average
229
+ metrics:
230
+ - name: pass@1
231
+ type: pass@1
232
+ value: 24.5
233
+ verified: false
234
  ---
235
 
236
+ ![Octopack](https://github.com/bigcode-project/octopack/blob/31f3320f098703c7910e43492c39366eeea68d83/banner.png?raw=true)
237
 
238
+ # OctoGeeX
 
 
239
 
240
+ Play with the model on the [TODO Playground](https://huggingface.co/spaces/bigcode/bigcode-playground).
 
 
241
 
242
+ ## Table of Contents
243
 
244
+ 1. [Model Summary](##model-summary)
245
+ 2. [Use](##use)
246
+ 3. [Limitations](##limitations)
247
+ 4. [Training](##training)
248
+ 5. [License](##license)
249
+ 6. [Citation](##citation)
250
 
251
+ ## Model Summary
252
 
253
+ OctoGeeX is an instruction tuned model with 6B parameters created by fine-tuning [CodeGeeX2](https://huggingface.co/THUDM/codegeex2-6b) on [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft) & [OASST](https://huggingface.co/datasets/bigcode/oasst-octopack) as described in the OctoPack paper.
 
 
 
254
 
255
+ - **Repository:** [bigcode/octopack](https://github.com/bigcode-project/octopack)
256
+ - **Paper:** [TODO]()
257
+ - **Languages:** 80+ Programming languages
258
+ - **OctoPack🐙🎒:**
259
+ <table>
260
+ <tr>
261
+ <th>Data</t>
262
+ <th><a href=https://huggingface.co/datasets/bigcode/commitpack>CommitPack</a></th>
263
+ <td>4TB of GitHub commits across 350 programming languages</td>
264
+ </tr>
265
+ <tr>
266
+ <th></t>
267
+ <th><a href=https://huggingface.co/datasets/bigcode/commitpackft>CommitPackFT</a></th>
268
+ <td>Filtered version of CommitPack for high-quality commit messages that resemble instructions</td>
269
+ </tr>
270
+ <tr>
271
+ <th>Model</t>
272
+ <th><a href=https://huggingface.co/bigcode/octocoder>OctoCoder</a></th>
273
+ <td>StarCoder (16B parameters) instruction tuned on CommitPackFT + OASST</td>
274
+ </tr>
275
+ <tr>
276
+ <th></t>
277
+ <th><a href=https://huggingface.co/bigcode/octogeex>OctoGeeX</a></th>
278
+ <td>CodeGeeX2 (6B parameters) instruction tuned on CommitPackFT + OASST</td>
279
+ </tr>
280
+ <tr>
281
+ <th>Evaluation&nbsp;&nbsp;</t>
282
+ <th><a href=https://huggingface.co/datasets/bigcode/humanevalpack>HumanEvalPack</a></th>
283
+ <td>Extension of OpenAI's HumanEval to cover 3 scenarios across 6 languages</td>
284
+ </tr>
285
+ </table>
286
 
 
287
 
288
+ ## Use
 
 
 
289
 
290
+ ### Intended use
291
 
292
+ The model follows instructions provided in the input. We recommend prefacing your input with "Question: " and finishing with "Answer:", for example: "Question: Please write a function in Python that performs bubble sort.\n\nAnswer:"
293
 
294
+ **Feel free to share your generations in the Community tab!**
 
 
 
 
295
 
296
+ ### Generation
297
  ```python
298
+ # pip install -q transformers
299
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
300
 
301
+ checkpoint = "bigcode/octogeex"
302
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
 
 
 
303
 
304
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
305
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
 
306
 
307
+ inputs = tokenizer.encode("Question: Please write a function in Python that performs bubble sort.\n\nAnswer:", return_tensors="pt").to(device)
308
+ outputs = model.generate(inputs)
309
+ print(tokenizer.decode(outputs[0]))
310
+ ```
311
 
312
+ # Training
 
 
 
 
 
313
 
314
+ ## Model
315
 
316
+ - **Architecture:** GPT-2 model with multi-query attention and Fill-in-the-Middle objective
317
+ - **Steps:** 250k pretraining & 30 instruction tuning
318
+ - **Pretraining tokens:** 1 trillion pretraining & 2M instruction tuning
319
+ - **Precision:** bfloat16
320
 
321
+ ## Hardware
322
 
323
+ - **Pretraining:**
324
+ - **GPUs:** 512 Tesla A100
325
+ - **Training time:** 24 days
326
+ - **Instruction tuning:**
327
+ - **GPUs:** 8 Tesla A100
328
+ - **Training time:** 4 hours
329
 
330
+ ## Software
331
 
332
+ - **Orchestration:** [Megatron-LM/Transformers](https://github.com/bigcode-project/octopack#training)
333
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
334
 
335
+ ## 协议 License
336
 
337
+ 本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源,模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。
338
 
339
+ The code in this repository is open-source under the [MIT license](https://github.com/bigcode-project/octopack/blob/main/LICENSE). The model weights are licensed under the [Model License](MODEL_LICENSE).
340
 
341
+ # Citation
342
 
343
+ TODO