File size: 7,460 Bytes
d53aecc 6c63bc7 b10d81e f8f4474 d53aecc 2d5e5eb d53aecc 3e05653 01badbb d53aecc 877e1a6 d53aecc f8f4474 cc9efae f8f4474 cc9efae f8f4474 d53aecc f8f4474 d53aecc 09b16ec d53aecc 010ca0c d53aecc f8f4474 d53aecc f8f4474 d53aecc 6c63bc7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
license: other
license_name: inf
license_link: https://huggingface.co/infly/OpenCoder-8B-Instruct/blob/main/LICENSE
language:
- en
- zh
base_model:
- infly/OpenCoder-8B-Base
pipeline_tag: text-generation
library_name: transformers
datasets:
- OpenCoder-LLM/opencoder-sft-stage1
- OpenCoder-LLM/opencoder-sft-stage2
---
<div align="center">
<img src="https://github.com/OpenCoder-llm/opencoder-llm.github.io/blob/main/static/images/opencoder_icon.jpg?raw=true" width="50%" alt="OpenCoder-Icon" />
</div>
<p align="center">
<!-- <a href="https://arxiv.org/pdf/2411.04905"><b>Paper Link</b>ποΈ</a> -->
π <a href="https://opencoder-llm.github.io/">Home Page</a>   |
   π€ <a href="https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e">Model</a>   |
   π <a href="https://huggingface.co/collections/OpenCoder-LLM/opencoder-datasets-672e6db6a0fed24bd69ef1c2">Dataset</a>   |
   π<a href="https://arxiv.org/abs/2411.04905">Paper</a>   |
   π<a href="https://huggingface.co/spaces/OpenCoder-LLM/OpenCoder-8B-Instruct">Demo</a>  
</p>
## 1. Introduction
**OpenCoder** is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols. Empowering researchers to build and innovate, OpenCoder is your open foundation for advancing code AI.
- **Complete Open Source**: OpenCoder ensures full transparency by releasing not only the model weights and forthcoming inference code but also the complete data-cleaning code for training. This release includes high-quality synthetic data, an extensive set of checkpoints, and a dataset of over 4.5 million supervised fine-tuning (SFT) entries, making OpenCoder one of the most comprehensively open-sourced models available.
- **Comprehensive Experimental Analysis**: OpenCoder is rigorously tested through extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication experiments, ensuring thorough exploration and validation of the modelβs performance.
- **High-Quality Synthetic Data**: OpenCoder provides a fully developed synthetic data generation process and over 4.5 million SFT data entries, establishing a robust data foundation for model training and evaluation.
- **Exceptional Performance**: OpenCoder achieves high performance across multiple language model benchmarks, positioning it among the leading open-source models for code.
## 2. Models
| Model | Sequence Length | Download |
|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
| OpenCoder-1.5B-Base | 4K | π€ [HuggingFace](https://huggingface.co/infly/OpenCoder-1.5B-Base) |
| OpenCoder-8B-Base | 8K | π€ [HuggingFace](https://huggingface.co/infly/OpenCoder-8B-Base) |
| OpenCoder-1.5B-Instruct | 4K | π€ [HuggingFace](https://huggingface.co/infly/OpenCoder-1.5B-Instruct) |
| OpenCoder-8B-Instruct | 8K | π€ [HuggingFace](https://huggingface.co/infly/OpenCoder-8B-Instruct) |
## 3. Datasets
### Pre-training
| Dataset | Size | Download |
|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
| fineweb-code-corpus | 148 GB | π€ [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus) |
| fineweb-math-corpus | 10 GB | π€ [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus) |
### Post-training
| Dataset | Num | Download |
|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
| opencoder-sft-stage1 | 4.21 M | π€ [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/opencoder-sft-stage1) |
| opencoder-sft-stage2 | 375 K | π€ [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/opencoder-sft-stage2) |
**This is not the end; we are organizing the remaining data and uploading it progressively.**
## 4. Benchmarks
**Note:** For the detailed evaluation results, please refer to [our paper](https://arxiv.org/pdf/2411.04905).
<!-- ### Base Model -->
<!-- | model | OpenCoder-1.5B-Base | OpenCoder-8B-Base |
|:---------------:|:-------------:|:------------:|
| HumanEval(+) | 54.3 (49.4) | 66.5 (63.4) |
| MBPP(+) | 70.6 (58.7) | 79.9 (70.4) |
| BigCodeBench | 24.5 | 40.5 |
| BigCodeBench-Hard | 5.4 | 9.5 | -->
<!-- ### Chat Model -->
| model | OpenCoder-1.5B-Instruct | OpenCoder-8B-Instruct |
|:---------------:|:-------------:|:------------:|
| HumanEval(+) | 72.5 (67.7) | 83.5 (78.7) |
| MBPP(+) | 72.7 (61.9) | 79.1 (69.0) |
| BigCodeBench | 33.3 | 40.3 |
| BigCodeBench-Hard | 11.5 | 16.9 |
| LiveCodeBench | 12.8 | 23.2 |
| MultiPL-E (AVG) | 57.5 | 71.0 |
## 5. Inference
### Inference with Huggingface's Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "infly/OpenCoder-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
messages=[
{ 'role': 'user', 'content': "write a quick sort algorithm in python."}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)
result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(result)
```
<!-- ### Inference with vLLM (recommended) -->
## 6. License
OpenCoder series (including Base and Chat) support commercial applications under a permissive [License](https://huggingface.co/infly/OpenCoder-8B-Instruct/blob/main/LICENSE).
## 7. Citation
```
@inproceedings{Huang2024OpenCoderTO,
title={OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author={Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year={2024},
url={https://arxiv.org/pdf/2411.04905}
}
``` |