๐Ÿ‘‹ join us on Twitter, Discord and WeChat

W4A16 LLM Model Deployment

LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80.

Before proceeding with the inference, please ensure that lmdeploy(>=v0.0.14) is installed.

pip install 'lmdeploy>=0.0.14'

4-bit LLM model Inference

You can download the pre-quantized 4-bit weight models from LMDeploy's model zoo and conduct inference using the following command.

Alternatively, you can quantize 16-bit weights to 4-bit weights following the "4-bit Weight Quantization" section, and then perform inference as per the below instructions.

Take the 4-bit Llama-2-70B model from the model zoo as an example:

git-lfs install
git clone https://huggingface.co/lmdeploy/llama2-chat-70b-4bit

As demonstrated in the command below, first convert the model's layout using turbomind.deploy, and then you can interact with the AI assistant in the terminal


## Convert the model's layout and store it in the default path, ./workspace.
lmdeploy convert \
    --model-name llama2 \
    --model-path ./llama2-chat-70b-w4 \
    --model-format awq \
    --group-size 128

## inference
lmdeploy chat ./workspace

Serve with gradio

If you wish to interact with the model via web ui, please initiate the gradio server as indicated below:

lmdeploy serve gradio ./workspace --server_name {ip_addr} --server_port {port}

Subsequently, you can open the website http://{ip_addr}:{port} in your browser and interact with the model

Inference Performance

We benchmarked the Llama 2 7B and 13B with 4-bit quantization on NVIDIA GeForce RTX 4090 using profile_generation.py. And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single batch inference.

model llm-awq mlc-llm turbomind
Llama 2 7B 112.9 159.4 206.4
Llama 2 13B N/A 90.7 115.8
pip install nvidia-ml-py
python profile_generation.py \
 --model-path /path/to/your/model \
 --concurrency 1 8 --prompt-tokens 0 512 --completion-tokens 2048 512

4-bit Weight Quantization

It includes two steps:

  • generate quantization parameter
  • quantize model according to the parameter

Step 1: Generate Quantization Parameter

lmdeploy lite calibrate \
  --model $HF_MODEL \
  --calib_dataset 'c4' \             # Calibration dataset, supports c4, ptb, wikitext2, pileval
  --calib_samples 128 \              # Number of samples in the calibration set, if memory is insufficient, you can appropriately reduce this
  --calib_seqlen 2048 \              # Length of a single piece of text, if memory is insufficient, you can appropriately reduce this
  --work_dir $WORK_DIR \             # Folder storing Pytorch format quantization statistics parameters and post-quantization weight

Step2: Quantize Weights

LMDeploy employs AWQ algorithm for model weight quantization.

lmdeploy lite auto_awq \
  --model $HF_MODEL \
  --w_bits 4 \                       # Bit number for weight quantization
  --w_sym False \                    # Whether to use symmetric quantization for weights
  --w_group_size 128 \               # Group size for weight quantization statistics
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1

After the quantization is complete, the quantized model is saved to $WORK_DIR. Then you can proceed with model inference according to the instructions in the "4-Bit Weight Model Inference" section.

Downloads last month
28
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using lmdeploy/llama2-chat-70b-4bit 5