Spaces:
Runtime error
Runtime error
File size: 2,365 Bytes
ec0c335 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# GPTQ 4bit Inference
Support GPTQ 4bit inference with [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
1. Window user: use the `old-cuda` branch.
2. Linux user: recommend the `fastest-inference-4bit` branch.
## Install
Setup environment:
```bash
# cd /path/to/FastChat
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git repositories/GPTQ-for-LLaMa
cd repositories/GPTQ-for-LLaMa
# Window's user should use the `old-cuda` branch
git switch fastest-inference-4bit
# Install `quant-cuda` package in FastChat's virtualenv
python3 setup_cuda.py install
pip3 install texttable
```
Chat with the CLI:
```bash
python3 -m fastchat.serve.cli \
--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
--gptq-wbits 4 \
--gptq-groupsize 128
```
Start model worker:
```bash
# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g
python3 -m fastchat.serve.model_worker \
--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
--gptq-wbits 4 \
--gptq-groupsize 128
# You can specify which quantized model to use
python3 -m fastchat.serve.model_worker \
--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
--gptq-ckpt models/vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.safetensors \
--gptq-wbits 4 \
--gptq-groupsize 128 \
--gptq-act-order
```
## Benchmark
| LLaMA-13B | branch | Bits | group-size | memory(MiB) | PPL(c4) | Median(s/token) | act-order | speed up |
| --------- | ---------------------- | ---- | ---------- | ----------- | ------- | --------------- | --------- | -------- |
| FP16 | fastest-inference-4bit | 16 | - | 26634 | 6.96 | 0.0383 | - | 1x |
| GPTQ | triton | 4 | 128 | 8590 | 6.97 | 0.0551 | - | 0.69x |
| GPTQ | fastest-inference-4bit | 4 | 128 | 8699 | 6.97 | 0.0429 | true | 0.89x |
| GPTQ | fastest-inference-4bit | 4 | 128 | 8699 | 7.03 | 0.0287 | false | 1.33x |
| GPTQ | fastest-inference-4bit | 4 | -1 | 8448 | 7.12 | 0.0284 | false | 1.44x |
|