Spaces:
Configuration error
Configuration error
PunchPunch22
commited on
Commit
•
23ddca4
1
Parent(s):
ea5dbd8
Upload 7 files
Browse files- README.md +79 -10
- config.json +54 -0
- gitattributes.txt +34 -0
- merges.txt +0 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- vocab.json +0 -0
README.md
CHANGED
@@ -1,13 +1,82 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
emoji: 🐠
|
4 |
-
colorFrom: indigo
|
5 |
-
colorTo: green
|
6 |
-
sdk: gradio
|
7 |
-
sdk_version: 3.34.0
|
8 |
-
app_file: app.py
|
9 |
-
pinned: false
|
10 |
-
license: mit
|
11 |
---
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: afl-3.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
+
# PyCodeGPT
|
6 |
+
A pre-trained GPT model for Python code completion and generation
|
7 |
+
|
8 |
+
## What is it?
|
9 |
+
|
10 |
+
PyCodeGPT is efficient and effective GPT-Neo-based model for python code generation task, which is similar to [OpenAI Codex](https://openai.com/blog/openai-codex/), [Github Copliot](https://copilot.github.com/), [CodeParrot](https://huggingface.co/blog/codeparrot), [AlphaCode](https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode).
|
11 |
+
|
12 |
+
## Training Data
|
13 |
+
Due to the small size of public released dataset, we proposed to collect data from GitHub from scratch. We first crawled 1.2M python-related repositories hosted by GitHub. Then, we used these repository URLs to download all contents of each repository from GitHub. After that, we got 60M raw python files under 1MB with a total size of 330GB. Finally, we carefully designed various strategies of data cleaning to get about 96GB data for training. Please refer to the following table for the details.
|
14 |
+
|
15 |
+
|Model|Repositories|Size and file after filtering|
|
16 |
+
|:------:|:---:|:---:|
|
17 |
+
| CodeParrot | 0.56M | 12GB (compressed), 5.4M |
|
18 |
+
| Codex | 54M | 159GB |
|
19 |
+
| PyCodeGPT | 1.2M | 96GB, 13M |
|
20 |
+
|
21 |
+
|
22 |
+
## Pretrained models
|
23 |
+
|
24 |
+
we aims to train median-large pre-trained models (model size with 110M) based on GPT-Neo:
|
25 |
+
- PyCodeGPT-110M: derived from GPT-Neo 125M with a vocabulary size of 32K.
|
26 |
+
|
27 |
+
## GitHub
|
28 |
+
|
29 |
+
[https://github.com/microsoft/PyCodeGPT](https://github.com/microsoft/PyCodeGPT)
|
30 |
+
|
31 |
+
## Evaluation Results
|
32 |
+
|
33 |
+
Here's our evaluation result on HumanEval dataset:
|
34 |
+
|
35 |
+
Note: our model can have a comparable accuracy with Codex of similar model size.
|
36 |
+
|
37 |
+
|Model|Pass@1|Pass@10|Pass@100|
|
38 |
+
|:------:|:---:|:---:|:---:|
|
39 |
+
|PyCodeGPT-110M |**8.32%** |**13.53%** |**18.3%** |
|
40 |
+
|||||
|
41 |
+
|GPT-Neo 125M |0.75% |1.88% |2.97% |
|
42 |
+
|GPT-Neo 1.3B |4.97% |7.47% |16.3% |
|
43 |
+
|GPT-Neo 2.7B |6.41% |11.27% |21.37% |
|
44 |
+
|GPT-J 6B |11.62% |15.74% |27.74% |
|
45 |
+
|||||
|
46 |
+
|TabNine |2.58% |4.35% |7.59% |
|
47 |
+
|||||
|
48 |
+
|CodeParrot 110M |3.80% |6.57% |12.78% |
|
49 |
+
|CodeParrot 1.5B |3.58% |8.03% |14.96% |
|
50 |
+
|||||
|
51 |
+
|Codex 12M |2.00% |3.62% |8.58% |
|
52 |
+
|Codex 25M |3.21% |7.1% |12.89% |
|
53 |
+
|Codex 42M |5.06% |8.8% |15.55% |
|
54 |
+
|Codex 85M |8.22% |12.81% |22.4% |
|
55 |
+
|Codex 300M |13.17% |20.37% |36.27% |
|
56 |
+
|Codex 679M |16.22% |25.7% |40.95% |
|
57 |
+
|Codex 2.5B |21.36% |35.42% |59.5% |
|
58 |
+
|Codex 12B |28.81% |46.81% |72.31% |
|
59 |
+
|||||
|
60 |
+
|Pretrained Decoder-only 13M (AlphaCode) |1.5% |3.6% |8.6% |
|
61 |
+
|Pretrained Decoder-only 29M (AlphaCode) |3.4% |5.8% |11.2% |
|
62 |
+
|Pretrained Decoder-only 55M (AlphaCode) |4.2% |8.2% |16.9% |
|
63 |
+
|Pretrained Decoder-only 89M (AlphaCode) |4.3% |12.2% |20.0% |
|
64 |
+
|Pretrained Decoder-only 302M (AlphaCode) |11.6% |18.8% |31.8% |
|
65 |
+
|Pretrained Decoder-only 685M (AlphaCode) |14.2% |24.4% |38.8% |
|
66 |
+
|Pretrained Decoder-only 1.1B (AlphaCode) |17.1% |28.2% |45.3% |
|
67 |
+
|||||
|
68 |
+
|PolyCoder 160M |2.13% |3.35% |4.88% |
|
69 |
+
|PolyCoder 400M |2.96% |5.29% |11.59% |
|
70 |
+
|PolyCoder 2.7B |5.59% |9.84% |17.68% |
|
71 |
+
|
72 |
+
## Reference
|
73 |
+
If you want to use the models, you need to cite our following paper:
|
74 |
+
|
75 |
+
```
|
76 |
+
@inproceedings{CERT,
|
77 |
+
title={{CERT}: Continual Pre-training on Sketches for Library-oriented Code Generation},
|
78 |
+
author={Zan, Daoguang and Chen, Bei and Yang, Dejian and Lin, Zeqi and Kim, Minsu and Guan, Bei and Wang, Yongji and Chen, Weizhu and Lou, Jian-Guang},
|
79 |
+
booktitle={The 2022 International Joint Conference on Artificial Intelligence},
|
80 |
+
year={2022}
|
81 |
+
}
|
82 |
+
```
|
config.json
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "//amlt0a3b8c9fa72c7a7e36e6cd517fb7abe6/data/pycode_func_0214_17M_codepy-110M/model",
|
3 |
+
"activation_function": "gelu_new",
|
4 |
+
"architectures": [
|
5 |
+
"GPTNeoForCausalLM"
|
6 |
+
],
|
7 |
+
"attention_dropout": 0,
|
8 |
+
"attention_layers": [
|
9 |
+
"global",
|
10 |
+
"local",
|
11 |
+
"global",
|
12 |
+
"local",
|
13 |
+
"global",
|
14 |
+
"local",
|
15 |
+
"global",
|
16 |
+
"local",
|
17 |
+
"global",
|
18 |
+
"local",
|
19 |
+
"global",
|
20 |
+
"local"
|
21 |
+
],
|
22 |
+
"attention_types": [
|
23 |
+
[
|
24 |
+
[
|
25 |
+
"global",
|
26 |
+
"local"
|
27 |
+
],
|
28 |
+
6
|
29 |
+
]
|
30 |
+
],
|
31 |
+
"bos_token_id": 1,
|
32 |
+
"embed_dropout": 0,
|
33 |
+
"eos_token_id": 0,
|
34 |
+
"gradient_checkpointing": false,
|
35 |
+
"hidden_size": 768,
|
36 |
+
"initializer_range": 0.02,
|
37 |
+
"intermediate_size": null,
|
38 |
+
"layer_norm_epsilon": 1e-05,
|
39 |
+
"max_position_embeddings": 2048,
|
40 |
+
"model_type": "gpt_neo",
|
41 |
+
"num_heads": 12,
|
42 |
+
"num_layers": 12,
|
43 |
+
"resid_dropout": 0,
|
44 |
+
"summary_activation": null,
|
45 |
+
"summary_first_dropout": 0.1,
|
46 |
+
"summary_proj_to_labels": true,
|
47 |
+
"summary_type": "cls_index",
|
48 |
+
"summary_use_proj": true,
|
49 |
+
"torch_dtype": "float32",
|
50 |
+
"transformers_version": "4.12.5",
|
51 |
+
"use_cache": true,
|
52 |
+
"vocab_size": 32000,
|
53 |
+
"window_size": 256
|
54 |
+
}
|
gitattributes.txt
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"bos_token": "<|beginoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|unkoftext|>", "pad_token": "<|padoftext|>"}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|