README.md · codeparrot/README at 6d0e573446d57c2e88151a752c2a4e1827957ac3

metadata

title: README
emoji: 👀
colorFrom: yellow
colorTo: purple
sdk: static
pinned: false

drawing

This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code.

Interactive blog where we compare different code models and explain how they are trained and evaluated: Code generation with 🤗
Spaces: code generation with: CodeParrot (1.5B), InCoder (6B) and CodeGen (6B)
Models: CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.
Datasets:
- codeparrot-clean, dataset on which we trained and evaluated CodeParrot, the splits are available under codeparrot-clean-train and codeparrot-clean-valid.
- A more filtered version of codeparrot-clean under codeparrot-train-more-filtering and codeparrot-train-more-filtering.
- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under codeparrot-train-near-deduplication and codeparrot-train-near-deduplication.
- GitHub-Code, a 1TB dataset of 32 programming languages with 60 from GitHub files.
- GitHub-Jupyter, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.
- APPS, a benchmark for code generation with 10000 problems.

Spaces:

codeparrot
/

README

Running