DeepSeek R1 AWQ

AWQ of DeepSeek R1.

This quant modified some of the model code to fix an overflow issue when using float16.

To serve using vLLM with 8x 80GB GPUs, use the following command:

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ

The max model length flag ensures that KV cache usage won't be higher than available memory, the moe_wna16 kernel doubles the inference speed, but you must build vLLM from source as of 2025/2/3.
You can download the wheel I built for PyTorch 2.6, Python 3.12 by clicking here.

Inference speed with batch size 1 and short prompt:

  • 8x H100: 34 TPS
  • 8x A100: 27 TPS
Downloads last month
10,106
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API does not yet support model repos that contain custom code.

Model tree for cognitivecomputations/DeepSeek-R1-AWQ

Quantized
(12)
this model