Significant Speed Drop with Increasing Input Length on H800 GPUs

#17
by wangkkk956 - opened

Here are my test results with single concurrent request. I found that output speeds vary quite a bit depending on input length.

Input Length (tokens) Decoding Speed (tokens/s)
20 46
2k 40
4k 36
8k 11

I'd like to know how to address this issue. Can it be resolved through parameter adjustments?
I'm running this on a server with eight H800 GPUs.

I used the recommended launch command:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 12345 \
    --max-model-len 65536 \
    --max-num-batched-tokens 65536 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.97 \
    --dtype float16 \
    --served-model-name deepseek-reasoner \
    --model cognitivecomputations/DeepSeek-R1-AWQ
Cognitive Computations org
edited 3 days ago

Add --max-seq-len-to-capture 65536. You can also delete --dtype float16 if you are using Hopper GPUs.

v2ray changed discussion status to closed

Sign up or log in to comment