visual-arena / docs /gptq.md
tianleliphoebe's picture
Upload folder using huggingface_hub
ec0c335 verified

A newer version of the Gradio SDK is available: 5.9.1

Upgrade

GPTQ 4bit Inference

Support GPTQ 4bit inference with GPTQ-for-LLaMa.

  1. Window user: use the old-cuda branch.
  2. Linux user: recommend the fastest-inference-4bit branch.

Install

Setup environment:

# cd /path/to/FastChat
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git repositories/GPTQ-for-LLaMa
cd repositories/GPTQ-for-LLaMa
# Window's user should use the `old-cuda` branch
git switch fastest-inference-4bit
# Install `quant-cuda` package in FastChat's virtualenv
python3 setup_cuda.py install
pip3 install texttable

Chat with the CLI:

python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-wbits 4 \
    --gptq-groupsize 128

Start model worker:

# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g

python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-wbits 4 \
    --gptq-groupsize 128

# You can specify which quantized model to use
python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-ckpt models/vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.safetensors \
    --gptq-wbits 4 \
    --gptq-groupsize 128 \
    --gptq-act-order

Benchmark

LLaMA-13B branch Bits group-size memory(MiB) PPL(c4) Median(s/token) act-order speed up
FP16 fastest-inference-4bit 16 - 26634 6.96 0.0383 - 1x
GPTQ triton 4 128 8590 6.97 0.0551 - 0.69x
GPTQ fastest-inference-4bit 4 128 8699 6.97 0.0429 true 0.89x
GPTQ fastest-inference-4bit 4 128 8699 7.03 0.0287 false 1.33x
GPTQ fastest-inference-4bit 4 -1 8448 7.12 0.0284 false 1.44x