Significant Speed Drop with Increasing Input Length on H800 GPUs

#17

by wangkkk956 - opened 3 days ago

3 days ago

Here are my test results with single concurrent request. I found that output speeds vary quite a bit depending on input length.

Input Length (tokens)	Decoding Speed (tokens/s)
20	46
2k	40
4k	36
8k	11

I'd like to know how to address this issue. Can it be resolved through parameter adjustments?
I'm running this on a server with eight H800 GPUs.

wangkkk956

3 days ago

I used the recommended launch command:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 12345 \
    --max-model-len 65536 \
    --max-num-batched-tokens 65536 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.97 \
    --dtype float16 \
    --served-model-name deepseek-reasoner \
    --model cognitivecomputations/DeepSeek-R1-AWQ

v2ray

Cognitive Computations org 3 days ago

•

edited 3 days ago

Add --max-seq-len-to-capture 65536. You can also delete --dtype float16 if you are using Hopper GPUs.

v2ray changed discussion status to closed 3 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment