File size: 5,823 Bytes
f432ba9
 
 
 
 
 
 
 
ffefb14
8ee26fa
ffefb14
8ee26fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52f8cd5
8ee26fa
 
 
 
 
2572c7c
 
 
 
1e75146
0b86e83
6d0e573
 
2ae6447
6d0e573
 
 
 
 
5b1095d
6d0e573
 
d9fdb7e
 
bfb487b
 
d9fdb7e
6d0e573
 
5b1095d
2ae6447
5b1095d
e3ffb64
d9fdb7e
d1605d4
 
 
 
 
e6844a8
 
 
d9fdb7e
 
5d5b548
 
 
 
d9fdb7e
d12d5eb
0b86e83
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
title: README
emoji: πŸ‘€
colorFrom: yellow
colorTo: purple
sdk: static
pinned: false
---
<p>
    <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/codeparrot2.png" alt="drawing" width="440"/>
</p>
<p>Check the new instruction-tuning resources:</p>
<ul>
<li>
<p>
    <b>InstructHumanEval: </b>a variant of HumanEval benchamrk adapted for instruction-tuned models<a
    href="https://huggingface.co/datasets/codeparrot/instructhumaneval"
    class="underline"> InstructHumanEval</a
    >
</p></li>
<li>
<p>
    <b>Full Curated CoNaLa: </b>we used UL2 to rewritte more than 590k uncurated intents in CoNaLa dataset<a
    href="https://huggingface.co/datasets/codeparrot/conala-mined-curated"
    class="underline"> conala-mined-curated</a
 >
</p></li>
<li>
<p>
    <b>Self-Instruct with StarCoder: </b>we release a selft-instruct dataset generated with StarCoder, as weel as the code we used to build it<a
    href="https://huggingface.co/datasets/codeparrot/self-instruct-starcoder"
    class="underline"> self-instruct-starcoder</a
 >
</p>
<li>
<p>
    <b>Models trained on CoNaLa and self-instruct StarCoder: </b>we release a the models we trained on the previous two datasets.
 
</p>
</li>


<hr>
<p>
    This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code. For advanced Code Language Models and
  pre-training datasets we recommend checking our work in the <a href="https://huggingface.co/bigcode">BigCode organization</a>. Here you can find:
</p>

<ul>
<li>
<p>
    <b>Interactive blog:</b> where we compare different code models and explain how they are trained and evaluated <a
    href="https://huggingface.co/spaces/loubnabnl/code-generation-models"
    class="underline">Code generation with πŸ€—</a
    >
</p>
</li>

<li>
<p>
<b>Spaces:</b> 

<li> - Code generation with: <a ref="https://huggingface.co/codeparrot/codeparrot" class="underline">CodeParrot (1.5B)</a>, <a href="https://huggingface.co/facebook/incoder-6B" class="underline">InCoder (6B)</a> and <a href="https://github.com/salesforce/CodeGen" class="underline">CodeGen (6B)</a></li>
<li> - Spaces for some code downstream tasks: algorthmic complexity prediction (BigO), code explanation and code generation from english text.</li>

</p>
</li>

<li><b>Models:</b> CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.</li>

<li><b>Metrics:</b> <a ref="https://huggingface.co/spaces/codeparrot/apps_metric" class="underline">APPS metric</a> for the evaluation of code models on <a href="https://huggingface.co/datasets/codeparrot/apps" class="underline">APPS</a> benchmark.</li>

<li><b>Datasets:</b><ul>
<li>1- <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean" class="underline">codeparrot-clean</a>, dataset on which we trained and evaluated CodeParrot, the splits are available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean-train" class="underline">codeparrot-clean-train</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid" class="underline">codeparrot-clean-valid</a>.</li>

<li>2- A more filtered version of codeparrot-clean under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering" class="underline">codeparrot-train-more-filtering</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering" class="underline">codeparrot-train-more-filtering</a>.</li>
<li>3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication" class="underline">codeparrot-train-near-deduplication</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-near-deduplication" class="underline">codeparrot-train-near-deduplication</a>.</li>
</li>
<li>4- CodeParrot dataset after both near deduplication and the additional filtering , it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-v2-near-dedup" class="underline">codeparrot-train-v2-near-dedup</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-v2-near-dedup" class="underline">codeparrot-valid-v2-near-dedup</a>.</li>
<li>5- <a href="https://huggingface.co/datasets/codeparrot/github-code" class="underline">GitHub-Code</a>, a 1TB dataset of 32 programming languages from GitHub files.</li>
<li>6- <a href="https://huggingface.co/datasets/codeparrot/github-code-clean" class="underline">GitHub-Code-Clean</a>, a cleaner version of GitHub-Code dataset.</li>
<li>7- <a href="https://huggingface.co/datasets/codeparrot/github-jupyter" class="underline">GitHub-Jupyter</a>, a 16.3GB dataset of Jupyter Notebooks  from BigQuery GitHub.</li>
<li>8- <a href="https://huggingface.co/datasets/codeparrot/github-jupyter-text-code-pairs" class="underline">github-jupyter-text-code-pairs</a>, a dataset of text and code pairs extracted from Jupyter notebooks, it is a parsed version of github-jupyter dataset.</li>
<li>9- <a href="https://huggingface.co/datasets/codeparrot/apps" class="underline">APPS</a>, a benchmark for code generation with 10000 problems.</li>
<li>10- <a href="https://huggingface.co/datasets/codeparrot/codecomplex" class="underline">CodeComplex</a>, an annotated dataset of 4,200 Java codes and their time complexity.</li>
<li>11- <a href="https://huggingface.co/datasets/codeparrot/xlcost-text-to-code" class="underline">XLCOST-text-to-code</a>, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.</li>

</ul>
</li>
</ul>