Update README.md
Browse files
README.md
CHANGED
@@ -1,34 +1,41 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
-
|
6 |
|
7 |
-
CodeShell
|
8 |
|
9 |
-
CodeShell
|
10 |
|
11 |
-
|
12 |
-
* 训练高效:基于高效的数据治理体系,冷启动训练500B高质量数据
|
13 |
-
* 体系完整:模型与IDE插件全栈技术体系开源
|
14 |
-
* 轻量快速:支持本地C++部署,提供轻量的本地化解决方案
|
15 |
-
* 评测全面:提供支持完整项目上下文的代码多任务评测体系(即将开源)
|
16 |
|
17 |
-
|
|
|
|
|
|
|
|
|
18 |
|
19 |
-
|
20 |
-
-
|
21 |
-
|
22 |
-
-
|
23 |
-
|
24 |
-
- JetBrains插件
|
25 |
|
26 |
-
|
27 |
-
## Model Use
|
28 |
|
29 |
### Code Generation
|
30 |
|
31 |
-
Codeshell 提供了Hugging Face
|
|
|
|
|
32 |
|
33 |
```python
|
34 |
import torch
|
@@ -44,89 +51,71 @@ print(tokenizer.decode(outputs[0]))
|
|
44 |
|
45 |
CodeShell 支持Fill-in-the-Middle模式,从而更好的支持软件开发过程。
|
46 |
|
47 |
-
|
|
|
|
|
48 |
input_text = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>"
|
49 |
inputs = tokenizer(input_text, return_tensors='pt').cuda()
|
50 |
outputs = model.generate(inputs)
|
51 |
print(tokenizer.decode(outputs[0]))
|
52 |
```
|
53 |
|
54 |
-
## Model
|
55 |
-
|
56 |
-
CodeShell 支持4 bit/8 bit量化,4 bit量化后,占用显存大小约6G。
|
57 |
-
|
58 |
-
```
|
59 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
60 |
-
tokenizer = AutoTokenizer.from_pretrained("codeshell", trust_remote_code=True)
|
61 |
-
model = AutoModelForCausalLM.from_pretrained("codeshell", trust_remote_code=True)
|
62 |
-
model = model.quantize(4).cuda()
|
63 |
-
|
64 |
-
inputs = tokenizer('def print_hello_world():', return_tensors='pt').cuda()
|
65 |
-
outputs = model.generate(inputs)
|
66 |
-
print(tokenizer.decode(outputs[0]))
|
67 |
-
```
|
68 |
-
|
69 |
-
## CodeShell IDE Plugin
|
70 |
-
|
71 |
-
### Web API
|
72 |
-
|
73 |
-
CodeShell提供了Web API部署工具,为IDE插件提供API支持。
|
74 |
-
|
75 |
-
```
|
76 |
-
git clone [email protected]:WisdomShell/codeshell.git
|
77 |
-
cd codeshell
|
78 |
-
python api.py
|
79 |
-
```
|
80 |
|
81 |
-
CodeShell提供了C/C++版本的推理支持,在没有GPU的个人PC上也能高效使用。开发者可以根据本地环境进行编译,详见[C/C++本地化部署工具]()。编译完成后,可以通过下列命令启动Web API服务。
|
82 |
|
83 |
-
|
84 |
-
./server -m codeshell.gguf
|
85 |
-
```
|
86 |
|
87 |
-
|
88 |
|
89 |
-
```
|
90 |
-
curl --location 'http://127.0.0.1:8080/completion' --header 'Content-Type: application/json' --data '{"messages": {"content": "用python写个hello world"}, "temperature": 0.2, "stream": true}'
|
91 |
-
```
|
92 |
|
93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
-
CodeShell提供 [VS Code插件](),开发者可以通过插件进行代码补全、代码问答等操作。VS Code 插件也已开源,插件相关问题欢迎在[VS Code插件仓库]()中讨论。
|
96 |
|
97 |
-
##
|
98 |
|
99 |
-
|
100 |
-
- Architecture: GPT-2
|
101 |
-
- Attention: Grouped-Query Attention with Flash Attention 2
|
102 |
-
- Position embedding: [Rotary Position Embedding](RoFormer: Enhanced Transformer with Rotary Position Embedding)
|
103 |
-
- Precision: bfloat16
|
104 |
-
- 超参数
|
105 |
-
- n_layer: 42
|
106 |
-
- n_embd: 4096
|
107 |
-
- n_inner: 16384
|
108 |
-
- n_head: 32
|
109 |
-
- num_query_groups: 8
|
110 |
-
- seq-length: 8192
|
111 |
-
- vocab_size: 70144
|
112 |
-
|
113 |
-
Code Shell使用GPT-2作为基础架构,并使用Grouped-Query Attention、RoPE相对位置编码等技术。
|
114 |
|
115 |
-
## Evaluation
|
116 |
|
117 |
-
|
118 |
|
119 |
### Pass@1
|
120 |
-
|
|
121 |
| ------- | --------- | --------- | --------- |
|
122 |
-
| humaneval |
|
123 |
-
| mbpp |
|
124 |
-
| multiple-java |
|
125 |
-
| multiple-js |
|
126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
|
128 |
# License
|
129 |
|
130 |
-
本仓库开源的模型遵循[Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
|
|
|
|
131 |
|
132 |
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- zh
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- codeshell
|
7 |
+
- wisdomshell
|
8 |
+
- pku-kcl
|
9 |
+
- openbankai
|
10 |
---
|
11 |
|
12 |
+
# CodeShell
|
13 |
|
14 |
+
CodeShell是[北京大学知识计算实验室](http://se.pku.edu.cn/kcl/)与蚌壳智能科技联合研发的多语言代码大模型基座。CodeShell具有70亿参数,在五千亿Tokens进行了训练,上下文窗口长度为8194。在权威的代码评估Benchmark(HumanEval与MBPP)上,CodeShell取得同等规模最好的性能。与此同时,我们提供了与CodeShell配套的部署方案与IDE插件,请参考代码库[CodeShell](https://github.com/WisdomShell/codeshell)。
|
15 |
|
16 |
+
CodeShell is a multi-language code LLM jointly developed by the [Knowledge Computing Lab](http://se.pku.edu.cn/kcl/) of Peking University and Bangke Intelligence Technology. CodeShell has 7 billion parameters and was trained on 500 billion tokens with a context window length of 8194. On authoritative code evaluation benchmarks (HumanEval and MBPP), CodeShell achieves the best performance of its scale. Meanwhile, we provide deployment solutions and IDE plugins that complement CodeShell. Please refer to the [CodeShell code repository](https://github.com/WisdomShell/codeshell) for more details."
|
17 |
|
18 |
+
## Main Characteristics of CodeShell
|
|
|
|
|
|
|
|
|
19 |
|
20 |
+
* **强大的性能**:CodelShell在HumanEval和MBPP上达到了7B代码基座大模型的最优性能
|
21 |
+
* **完整的体系**:除了代码大模型,同时开源IDE(VS Code与JetBrains)插件,形成开源的全栈技术体系
|
22 |
+
* **轻量化部署**:支持本地C++部署,提供轻量快速的本地化软件开发助手解决方案
|
23 |
+
* **全面的评测**:提供支持完整项目上下文、覆盖代码生成、代码缺陷检测与修复、测试用例生成等常见软件开发活动的多任务评测体系(即将开源)
|
24 |
+
* **高效的训练**:基于高效的数据治理体系,CodeShell在完全冷启动情况下,只训练了五千亿Token即获得了优异的性能
|
25 |
|
26 |
+
* **Powerful Performance**: CodeShell achieves optimal performance for a 7B code base model on HumanEval and MBPP.
|
27 |
+
* **Complete Ecosystem**: In addition to the mega code model, open-source IDE plugins (for VS Code and JetBrains) are also available, forming a comprehensive open-source full-stack technology system.
|
28 |
+
* **Lightweight Deployment**: Supports local C++ deployment, offering a lightweight and fast localized software development assistant solution.
|
29 |
+
* **Comprehensive Evaluation**: Provides a multi-task evaluation system that supports full project context, covering code generation, code defect detection and repair, test case generation, and other common software development activities (to be open-sourced soon).
|
30 |
+
* **Efficient Training**: Based on an efficient data governance system, CodeShell, even when starting from scratch, achieved outstanding performance with training on just 500 trillion tokens.
|
|
|
31 |
|
32 |
+
## Quickstart
|
|
|
33 |
|
34 |
### Code Generation
|
35 |
|
36 |
+
Codeshell 提供了Hugging Face格式的模型,开发者可以通过下列代码加载并使用。
|
37 |
+
|
38 |
+
Codeshell offers a model in the Hugging Face format. Developers can load and use it with the following code.
|
39 |
|
40 |
```python
|
41 |
import torch
|
|
|
51 |
|
52 |
CodeShell 支持Fill-in-the-Middle模式,从而更好的支持软件开发过程。
|
53 |
|
54 |
+
CodeShell supports the Fill-in-the-Middle mode, thereby better facilitating the software development process.
|
55 |
+
|
56 |
+
```python
|
57 |
input_text = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>"
|
58 |
inputs = tokenizer(input_text, return_tensors='pt').cuda()
|
59 |
outputs = model.generate(inputs)
|
60 |
print(tokenizer.decode(outputs[0]))
|
61 |
```
|
62 |
|
63 |
+
## Model Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
|
|
65 |
|
66 |
+
Code Shell使用GPT-2作为基础架构,采用Grouped-Query Attention、RoPE相对位置编码等技术。
|
|
|
|
|
67 |
|
68 |
+
Code Shell uses GPT-2 as its foundational architecture and incorporates technologies such as Grouped-Query Attention and RoPE relative position encoding.
|
69 |
|
|
|
|
|
|
|
70 |
|
71 |
+
| Hyper-parameter | Value |
|
72 |
+
|---|---|
|
73 |
+
| n_layer | 42 |
|
74 |
+
| n_embd | 4096 |
|
75 |
+
| n_inner | 16384 |
|
76 |
+
| n_head | 32 |
|
77 |
+
| num_query_groups | 8 |
|
78 |
+
| seq-length | 8192 |
|
79 |
+
| vocab_size | 70144 |
|
80 |
|
|
|
81 |
|
82 |
+
## Evaluation
|
83 |
|
84 |
+
我们选取了目前最流行的两个代码评测数据集(HumanEval与MBPP)对模型进行评估,与目前最先进的两个7b代码大模型CodeLllama与Starcoder相比,Codeshell 取得了最优的成绩。具体评测结果如下。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
|
|
|
86 |
|
87 |
+
We selected the two most popular code evaluation datasets currently available (HumanEval and MBPP) to assess the model. Compared to the two most advanced 7b LLM for code, CodeLllama and Starcoder, Codeshell achieved the best results. The specific evaluation results are as follows.
|
88 |
|
89 |
### Pass@1
|
90 |
+
| Task | codeshell | codellama-7B | starcoderbase-7B |
|
91 |
| ------- | --------- | --------- | --------- |
|
92 |
+
| humaneval | 33.48 | 29.44 | 27.80 |
|
93 |
+
| mbpp | 39.08 | 37.60 | 34.16 |
|
94 |
+
| multiple-java | 29.56 | 29.24 | 24.30 |
|
95 |
+
| multiple-js | 33.60 | 31.30 | 27.02 |
|
96 |
+
| multiple-r | 20.99 | 18.57 | 14.29 |
|
97 |
+
| multiple-rkt | 12.48 | 12.55 | 10.43 |
|
98 |
+
| multiple-cpp | 28.20 | 27.33 | 23.04 |
|
99 |
+
| multiple-cs | 22.34 | 20.38 | 18.99 |
|
100 |
+
| multiple-d | 8.59 | 11.60 | 8.08 |
|
101 |
+
| multiple-go | 71.69 | 75.91 | 73.83 |
|
102 |
+
| multiple-jl | 20.63 | 25.28 | 22.96 |
|
103 |
+
| multiple-lua | 22.92 | 30.50 | 22.92 |
|
104 |
+
| multiple-php | 30.43 | 25.96 | 22.11 |
|
105 |
+
| multiple-pl | 15.65 | 17.45 | 16.40 |
|
106 |
+
| multiple-py | 33.54 | 29.25 | 28.82 |
|
107 |
+
| multiple-rb | 25.71 | 30.06 | 18.51 |
|
108 |
+
| multiple-rs | 26.86 | 25.90 | 22.82 |
|
109 |
+
| multiple-swift | 25.00 | 25.32 | 15.70 |
|
110 |
+
| multiple-ts | 33.90 | 32.64 | 27.48 |
|
111 |
+
| multiple-sh | 8.42 | 9.75 | 7.09 |
|
112 |
+
| multiple-scala | 22.56 | 24.50 | 19.12 |
|
113 |
|
114 |
# License
|
115 |
|
116 |
+
本仓库开源的模型遵循[Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0),对学术研究完全开放,若需要商用,开发者可发送邮件进行申请,得到书面授权后即可使用。联系邮箱:[[email protected]](mailto:[email protected])
|
117 |
+
|
118 |
+
|
119 |
+
The model open-sourced in this repository follows the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). It is fully open for academic research. For commercial use, developers can send an email to apply. Once written authorization is obtained, it can be used. Contact email: [[email protected]](mailto:[email protected]).
|
120 |
|
121 |
|