Spaces:
Running
Running
metadata
title: README
emoji: π
colorFrom: yellow
colorTo: purple
sdk: static
pinned: false
This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code.
Table of contents:
-
Interactive blog where we compare different code models and explain how they are trained and evaluated: Code generation with π€
-
Spaces: code generation with: CodeParrot (1.5B), InCoder (6B) and CodeGen (6B)
- Models: CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.
- Datasets:
- codeparrot-clean, dataset on which we trained and evaluated CodeParrot, the splits are available under codeparrot-clean-train and codeparrot-clean-valid.
- A more filtered version of codeparrot-clean under codeparrot-train-more-filtering and codeparrot-train-more-filtering.
- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under codeparrot-train-near-deduplication and codeparrot-train-near-deduplication.
- GitHub-Code, a 1TB dataset of 32 programming languages with 60 from GitHub files.
- GitHub-Jupyter, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.
- APPS, a benchmark for code generation with 10000 problems.