loubnabnl HF staff commited on
Commit
e6844a8
·
1 Parent(s): 5b1095d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -39,10 +39,11 @@ pinned: false
39
 
40
  <li>2- A more filtered version of codeparrot-clean under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering" class="underline">codeparrot-train-more-filtering</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering" class="underline">codeparrot-train-more-filtering</a>.</li>
41
  <li>3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication" class="underline">codeparrot-train-near-deduplication</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-near-deduplication" class="underline">codeparrot-train-near-deduplication</a>.</li>
42
-
43
- <li>4- <a href="https://huggingface.co/datasets/codeparrot/github-code" class="underline">GitHub-Code</a>, a 1TB dataset of 32 programming languages from GitHub files.</li>
44
- <li>5- <a href="https://huggingface.co/datasets/codeparrot/github-jupyter" class="underline">GitHub-Jupyter</a>, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.</li>
 
45
  <li>6- <a href="https://huggingface.co/datasets/codeparrot/apps" class="underline">APPS</a>, a benchmark for code generation with 10000 problems.</li>
46
- </ul>
47
  </li>
48
  </ul>
 
39
 
40
  <li>2- A more filtered version of codeparrot-clean under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering" class="underline">codeparrot-train-more-filtering</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering" class="underline">codeparrot-train-more-filtering</a>.</li>
41
  <li>3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication" class="underline">codeparrot-train-near-deduplication</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-near-deduplication" class="underline">codeparrot-train-near-deduplication</a>.</li>
42
+ </li>
43
+ <li>4- CodeParrot dataset after both near deduplication and the additional filtering , it's available under <a href="https://huggingface.co/datasets/codeparrot/codeparrot-train-v2-near-dedup" class="underline">codeparrot-train-v2-near-dedup</a> and <a href="https://huggingface.co/datasets/codeparrot/codeparrot-valid-v2-near-dedup" class="underline">codeparrot-valid-v2-near-dedup</a>.</li>
44
+ <li>5- <a href="https://huggingface.co/datasets/codeparrot/github-code" class="underline">GitHub-Code</a>, a 1TB dataset of 32 programming languages from GitHub files.</li>
45
+ <li>5- <a href="https://huggingface.co/datasets/codeparrot/github-jupyter" class="underline">GitHub-Jupyter</a>, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitH6b.</li>
46
  <li>6- <a href="https://huggingface.co/datasets/codeparrot/apps" class="underline">APPS</a>, a benchmark for code generation with 10000 problems.</li>
47
+ </ul7
48
  </li>
49
  </ul>