cognitivecomputations/DeepSeek-R1-AWQ · triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or `num

8 days ago

I deployment the model on 2* 8*A40(48G) ，2 ubuntu server used ray cluster , not docker，when i request the server, it print the the error,and crashed.
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or num_stages may help

the vllm command( used ray cluster):
vllm serve "/data/model/hub/cognitivecomputations/DeepSeek-R1-awq/" --served-model-name deepseekr1 --port 8989 --trust_remote_code --tensor-parallel-size 8 -pp 2 --enable-prefix-caching --enable-chunked-prefill --calculate-kv-scales --kv-cache-dtype fp8_e5m2 --quantization moe_wna16 --gpu-memory-utilization 0.8 --max_model_len 4096

please tell me how to config , 3Q!!

v2ray

Cognitive Computations org 4 days ago

I did not test this on more than 8 GPUs, please file this issue in vLLM.

v2ray changed discussion status to closed 4 days ago

cognitivecomputations
/

DeepSeek-R1-AWQ

triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or `num_stages` may help