triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or `num_stages` may help
I deployment the model on 2* 8*A40(48G) ,2 ubuntu server used ray cluster , not docker,when i request the server, it print the the error,and crashed.
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or num_stages
may help
the vllm command( used ray cluster):
vllm serve "/data/model/hub/cognitivecomputations/DeepSeek-R1-awq/" --served-model-name deepseekr1 --port 8989 --trust_remote_code --tensor-parallel-size 8 -pp 2 --enable-prefix-caching --enable-chunked-prefill --calculate-kv-scales --kv-cache-dtype fp8_e5m2 --quantization moe_wna16 --gpu-memory-utilization 0.8 --max_model_len 4096
please tell me how to config , 3Q!!
I did not test this on more than 8 GPUs, please file this issue in vLLM.