|
--- |
|
frameworks: |
|
- Pytorch |
|
license: other |
|
tasks: |
|
- text-generation |
|
--- |
|
|
|
# Model Card for CodeFuse-CodeLlama-34B-4bits |
|
<p align="left"> |
|
<img src="./LOGO.png" width="100%" /> |
|
</p> |
|
|
|
[[中文]](#chinese) [[English]](#english) |
|
|
|
|
|
|
|
<a id="english"></a> |
|
|
|
## Model Description |
|
|
|
CodeFuse-CodeLlama-34B-4bits is the 4-bit quantized version of CodeFuse-CodeLlama-34B, which is a 34B Code-LLM fine-tuned over multiple code tasks(600k instrunctions/answers)on the base model CodeLlama-34b-Python. |
|
|
|
After undergoing 4-bit quantization, the CodeFuse-CodeLlama-34B-4bits model can be loaded on either a single A10 (24GB VRAM) or a RTX 4090 (24GB VRAM). Moreover, the quantized model still achives an impressive accuracy of 73.8% on the Humaneval pass@1 metric. |
|
|
|
|
|
<br> |
|
|
|
## News and Updates |
|
|
|
|
|
🔥🔥🔥 2023-09-26 We are pleased to announce the release of the 4-bit quantized version of CodeFuse-CodeLlama-34B. Despite the quantization process, the model still achieves a remarkable 73.8% accuracy (greedy decoding) on the HumanEval pass@1 metric. |
|
|
|
🔥🔥🔥 2023-09-11 CodeFuse-CodeLlama34B has achieved 74.4% of pass@1 (greedy decoding) on HumanEval, which is SOTA results for openspurced LLMs at present. |
|
|
|
<br> |
|
|
|
## Code Community |
|
|
|
**Homepage**: 🏡 https://github.com/codefuse-ai (**Please give us your support with a Star🌟 + Fork🚀 + Watch👀**) |
|
|
|
+ If you wish to fine-tune the model yourself, you can visit ✨[MFTCoder](https://github.com/codefuse-ai/MFTCoder)✨✨ |
|
|
|
+ If you wish to deploy the model yourself, you can visit ✨[FasterTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse)✨✨ |
|
|
|
+ If you wish to see a demo of the model, you can visit ✨[CodeFuse Demo](https://github.com/codefuse-ai/codefuse)✨✨ |
|
|
|
<br> |
|
|
|
## Performance |
|
|
|
|
|
| Model | HumanEval(pass@1) | Date | |
|
|:--------------------------------|:-----------------:|:-------:| |
|
| **CodeFuse-CodeLlama-34B** | **74.4%** | 2023.9 | |
|
|**CodeFuse-CodeLlama-34B-4bits** | **73.8%** | 2023.9 | |
|
| WizardCoder-Python-34B-V1.0 | 73.2% | 2023.8 | |
|
| GPT-4(zero-shot) | 67.0% | 2023.3 | |
|
| PanGu-Coder2 15B | 61.6% | 2023.8 | |
|
| CodeLlama-34b-Python | 53.7% | 2023.8 | |
|
| CodeLlama-34b | 48.8% | 2023.8 | |
|
| GPT-3.5(zero-shot) | 48.1% | 2022.11 | |
|
| OctoCoder | 46.2% | 2023.8 | |
|
| StarCoder-15B | 33.6% | 2023.5 | |
|
| LLaMA 2 70B(zero-shot) | 29.9% | 2023.7 | |
|
|
|
<br> |
|
|
|
## GPU Memory Usage |
|
We measured the GPU memory usage after loading the model, as well as the memory usage when encoding 2048/1024 tokens and generating 1024/2048 tokens. The results are presented in the table below. |
|
|
|
| Precision | Idle Model | Encoding 2048 tokens and Generating 1024 tokens | Encoding 1024 tokens and Generating 2048 tokens | |
|
|:--------------------------------|:-------------------|:------------------------:|:------------:| |
|
|bfloat16 | 64.89GB | 69.31GB | 66.41GB | |
|
|int4 | 19.09GB | 22.19GB | 20.78GB | |
|
|
|
<br> |
|
|
|
## Requirements |
|
|
|
* python>=3.8 |
|
* pytorch>=2.0.0 |
|
* transformers==4.32.0 |
|
* auto_gptq==0.4.2 |
|
* Sentencepiece |
|
* CUDA 11.4 |
|
|
|
<br> |
|
|
|
## Inference String Format |
|
|
|
The inference string is a concatenated string formed by combining conversation data (human and bot contents) in the training data format. It is used as input during the inference process. |
|
Here is an example format of the concatenated string: |
|
|
|
```python |
|
""" |
|
<|role_start|>human<|role_end|>Human 1st round input |
|
<|role_start|>bot<|role_end|>Bot 1st round output</s> |
|
<|role_start|>human<|role_end|>Human 2nd round input |
|
<|role_start|>bot<|role_end|>Bot 2nd round output</s> |
|
... |
|
... |
|
... |
|
<|role_start|>human<|role_end|>Human nth round input |
|
<|role_start|>bot<|role_end|>{Bot output to be genreated}</s> |
|
""" |
|
``` |
|
|
|
When applying inference, you always make your input string end with "<|role_start|>bot<|role_end|>" to ask the model generating answers. |
|
|
|
<br> |
|
|
|
## Quickstart |
|
|
|
```bash |
|
git clone https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git |
|
``` |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
```python |
|
import os |
|
import torch |
|
import time |
|
from transformers import AutoTokenizer |
|
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig |
|
|
|
os.environ["TOKENIZERS_PARALLELISM"] = "false" |
|
|
|
def load_model_tokenizer(model_name_or_local_path): |
|
""" |
|
Load model and tokenizer based on the given model name or local path of the downloaded model. |
|
""" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_local_path, |
|
trust_remote_code=True, |
|
use_fast=False, |
|
lagecy=False) |
|
tokenizer.padding_side = "left" |
|
|
|
model = AutoGPTQForCausalLM.from_quantized(model_name_or_local_path, |
|
inject_fused_attention=False, |
|
inject_fused_mlp=False, |
|
use_cuda_fp16=True, |
|
disable_exllama=False, |
|
device_map='auto' # Support multi-gpus |
|
) |
|
return model, tokenizer |
|
|
|
|
|
def inference(model, tokenizer, prompt): |
|
""" |
|
Uset the given model and tokenizer to generate an answer for the specified prompt. |
|
""" |
|
st = time.time() |
|
prompt = prompt if prompt.endswith('\n') else f'{prompt}\n' |
|
inputs = f"<|role_start|>human<|role_end|>{prompt}<|role_start|>bot<|role_end|>" |
|
|
|
input_ids = tokenizer.encode(inputs, |
|
return_tensors="pt", |
|
padding=True, |
|
add_special_tokens=False).to("cuda") |
|
with torch.no_grad(): |
|
generated_ids = model.generate( |
|
input_ids=input_ids, |
|
top_p=0.95, |
|
temperature=0.1, |
|
do_sample=True, |
|
max_new_tokens=512, |
|
eos_token_id=tokenizer.eos_token_id, |
|
pad_token_id=tokenizer.pad_token_id |
|
) |
|
print(f'generated tokens num is {len(generated_ids[0][input_ids.size(1):])}') |
|
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(f'generate text is {outputs[0][len(inputs): ]}') |
|
latency = time.time() - st |
|
print('latency is {} seconds'.format(latency)) |
|
|
|
|
|
if __name__ == "__main__": |
|
model_name_or_local_path = '<Mole name (i.e. codefuse-ai/CodeFuse-CodeLlama-34B-4bits) or local path of the downloaded model>' |
|
prompt = 'Please write a QuickSort program in Python' |
|
|
|
model, tokenizer = load_model_tokenizer(model_name_or_local_path) |
|
inference(model, tokenizer, prompt) |
|
``` |
|
|
|
**The current inference example code is based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ). If you want to achieve higher inference speed, it is recommended to combine it with [TensorRT-LLM (Early Access)](https://developer.nvidia.com/tensorrt-llm-early-access).** |
|
|
|
<br> |
|
|
|
## Consistency Check |
|
Here, SHA256 values are provided for the model-related files for consistency check during the download. |
|
|
|
| File | SHA256 | |
|
|-------------------------------:|:--------------------------------:| |
|
|config.json | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b | |
|
|generation_config.json | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 | |
|
|gptq_model-4bit-64g.bin | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a | |
|
|pytorch_model.bin.index.json | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d | |
|
|quantize_config.json | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 | |
|
|special_tokens_map.json | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 | |
|
|tokenizer.json | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 | |
|
|tokenizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 | |
|
|tokenizer_config.json | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 | |
|
|
|
|
|
<br> |
|
<br> |
|
|
|
|
|
<a id="chinese"></a> |
|
|
|
## 模型简介 |
|
|
|
CodeFuse-CodeLlama-34B-4bits是CodeFuse-CodeLlama-34B模型的4bits量化版本,后者是通过QLoRA对基座模型CodeLlama-34b-Python进行多代码任务微调而得到的代码大模型,模型输入长度为4K。 |
|
|
|
经4bits量化后,CodeFuse-CodeLlama-34B-4bits可用单张A10 (24GB显存)或者RTX 4090 (24GB显存)加载,同时,量化后的模型在Humaneval pass@1指标上仍取得了73.8%的表现。 |
|
|
|
<br> |
|
|
|
## 新闻 |
|
|
|
🔥🔥🔥 2023-09-26 CodeFuse-CodeLlama-34B 4bits量化版本发布,量化后模型在HumanEval pass@1指标为73.8% (贪婪解码)。 |
|
|
|
🔥🔥🔥 2023-09-11 CodeFuse-CodeLlama-34B发布,HumanEval pass@1指标达到74.4% (贪婪解码), 为当前开源SOTA。 |
|
|
|
<br> |
|
|
|
## 代码社区 |
|
**大本营**: 🏡 https://github.com/codefuse-ai (**请支持我们的项目Star🌟 + Fork🚀 + Watch👀**) |
|
|
|
+ 如果您想自己微调该模型,可以访问 ✨[MFTCoder](https://github.com/codefuse-ai/MFTCoder)✨✨ |
|
|
|
+ 如果您想自己部署该模型,可以访问 ✨[FasterTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse)✨✨ |
|
|
|
+ 如果您想观看该模型示例,可以访问 ✨[CodeFuse Demo](https://github.com/codefuse-ai/codefuse)✨✨ |
|
|
|
<br> |
|
|
|
## 评测表现(代码) |
|
|
|
|
|
| 模型 | HumanEval(pass@1) | 日期 | |
|
|:--------------------------------|:-----------------:|:-------:| |
|
| **CodeFuse-CodeLlama-34B** | **74.4%** | 2023.9 | |
|
|**CodeFuse-CodeLlama-34B-4bits** | **73.8%** | 2023.9 | |
|
| WizardCoder-Python-34B-V1.0 | 73.2% | 2023.8 | |
|
| GPT-4(zero-shot) | 67.0% | 2023.3 | |
|
| PanGu-Coder2 15B | 61.6% | 2023.8 | |
|
| CodeLlama-34b-Python | 53.7% | 2023.8 | |
|
| CodeLlama-34b | 48.8% | 2023.8 | |
|
| GPT-3.5(zero-shot) | 48.1% | 2022.11 | |
|
| OctoCoder | 46.2% | 2023.8 | |
|
| StarCoder-15B | 33.6% | 2023.5 | |
|
| LLaMA 2 70B(zero-shot) | 29.9% | 2023.7 | |
|
<br> |
|
|
|
## 显存使用 |
|
我们测量了模型加载后占用的显存占用情况,以及输入2048/1024 tokens并输出1024/2048 tokens时的显存使用情况,如下表所示 |
|
|
|
| 精度 | 模型空载 | 输入2048 tokens + 输出1024 tokens | 输入1024 tokens + 输出2048 tokens | |
|
|:--------------------------------|:-------------------|:------------------------:|:------------:| |
|
|bfloat16 | 64.89GB | 69.31GB | 66.41GB | |
|
|int4 | 19.09GB | 22.19GB | 20.78GB | |
|
|
|
<br> |
|
|
|
## 依赖要求 |
|
|
|
* python>=3.8 |
|
* pytorch>=2.0.0 |
|
* transformers==4.32.0 |
|
* auto_gptq==0.4.2 |
|
* Sentencepiece |
|
* CUDA 11.4 |
|
|
|
<br> |
|
|
|
## 推理数据格式 |
|
|
|
推理数据为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式: |
|
|
|
```python |
|
""" |
|
<|role_start|>human<|role_end|>Human 1st round input |
|
<|role_start|>bot<|role_end|>Bot 1st round output</s> |
|
<|role_start|>human<|role_end|>Human 2nd round input |
|
<|role_start|>bot<|role_end|>Bot 2nd round output</s> |
|
... |
|
... |
|
... |
|
<|end|><|role_start|>human<|role_end|>Human nth round input |
|
<|end|><|role_start|>bot<|role_end|>{Bot output to be genreated}</s> |
|
""" |
|
``` |
|
|
|
推理时,请确保拼接的prompt字符串以"<|role_start|>bot<|role_end|>"结尾,引导模型生成回答。 |
|
|
|
<br> |
|
|
|
## 快速使用 |
|
|
|
```bash |
|
git clone https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git |
|
``` |
|
|
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
|
|
```python |
|
import os |
|
import torch |
|
import time |
|
from transformers import AutoTokenizer |
|
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig |
|
|
|
os.environ["TOKENIZERS_PARALLELISM"] = "false" |
|
|
|
def load_model_tokenizer(model_name_or_local_path): |
|
""" |
|
Load model and tokenizer based on the given model name or local path of downloaded model. |
|
""" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_local_path, |
|
trust_remote_code=True, |
|
use_fast=False, |
|
lagecy=False) |
|
tokenizer.padding_side = "left" |
|
|
|
model = AutoGPTQForCausalLM.from_quantized(model_name_or_local_path, |
|
inject_fused_attention=False, |
|
inject_fused_mlp=False, |
|
use_cuda_fp16=True, |
|
disable_exllama=False, |
|
device_map='auto' # Support multi-gpus |
|
) |
|
return model, tokenizer |
|
|
|
|
|
def inference(model, tokenizer, prompt): |
|
""" |
|
Uset the given model and tokenizer to generate an answer for the speicifed prompt. |
|
""" |
|
st = time.time() |
|
prompt = prompt if prompt.endswith('\n') else f'{prompt}\n' |
|
inputs = f"<|role_start|>human<|role_end|>{prompt}<|role_start|>bot<|role_end|>" |
|
|
|
input_ids = tokenizer.encode(inputs, |
|
return_tensors="pt", |
|
padding=True, |
|
add_special_tokens=False).to("cuda") |
|
with torch.no_grad(): |
|
generated_ids = model.generate( |
|
input_ids=input_ids, |
|
top_p=0.95, |
|
temperature=0.1, |
|
do_sample=True, |
|
max_new_tokens=512, |
|
eos_token_id=tokenizer.eos_token_id, |
|
pad_token_id=tokenizer.pad_token_id |
|
) |
|
print(f'generated tokens num is {len(generated_ids[0][input_ids.size(1):])}') |
|
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(f'generate text is {outputs[0][len(inputs): ]}') |
|
latency = time.time() - st |
|
print('latency is {} seconds'.format(latency)) |
|
|
|
|
|
if __name__ == "__main__": |
|
model_name_or_local_path = '<模型名字 (即codefuse-ai/CodeFuse-CodeLlama-34B-4bits)或者提前下载到本地的模型路径>' |
|
prompt = '请用Python实现一个快速排序算法' |
|
|
|
model, tokenizer = load_model_tokenizer(model_name_or_local_path) |
|
inference(model, tokenizer, prompt) |
|
``` |
|
|
|
**目前的推理示例代码是基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的,如果你想获取更高的推理速度,建议结合使用[TensorRT-LLM (Early Access)](https://developer.nvidia.com/tensorrt-llm-early-access)。** |
|
|
|
<br> |
|
|
|
## 一致性校验 |
|
这里提供了模型相关文件的SHA256值,用于下载一致性校验。 |
|
|
|
| 文件 | SHA256 | |
|
|-------------------------------:|:--------------------------------:| |
|
|config.json | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b | |
|
|generation_config.json | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 | |
|
|gptq_model-4bit-64g.bin | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a | |
|
|pytorch_model.bin.index.json | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d | |
|
|quantize_config.json | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 | |
|
|special_tokens_map.json | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 | |
|
|tokenizer.json | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 | |
|
|tokenizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 | |
|
|tokenizer_config.json | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 | |
|
|
|
|