Significant Speed Drop with Increasing Input Length on H800 GPUs
#17
by
wangkkk956
- opened
Here are my test results with single concurrent request. I found that output speeds vary quite a bit depending on input length.
Input Length (tokens) | Decoding Speed (tokens/s) |
---|---|
20 | 46 |
2k | 40 |
4k | 36 |
8k | 11 |
I'd like to know how to address this issue. Can it be resolved through parameter adjustments?
I'm running this on a server with eight H800 GPUs.
I used the recommended launch command:
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 12345 \
--max-model-len 65536 \
--max-num-batched-tokens 65536 \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.97 \
--dtype float16 \
--served-model-name deepseek-reasoner \
--model cognitivecomputations/DeepSeek-R1-AWQ
Add --max-seq-len-to-capture 65536
. You can also delete --dtype float16
if you are using Hopper GPUs.
v2ray
changed discussion status to
closed