when i use vllm v0.7.2 to deploy r1 awq, i got empty content

#10
by bupalinyu - opened

curl http://localhost:23336/v1/chat/completions
-H "Content-Type: application/json"
-d '{"model": "deepseek-reasoner",
"messages": [
{"role": "user", "content": "你是谁"}
],
"stream":true,
"temperature":1.2
}'
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 23333 --max-model-len 60000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.92 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model ${LLM_MODEL_DIR}

Same errors. And if you set "skip_special_tokens" to false when sampling, you'll find it's not empty content but repeated <|begin_of_sentence|> tokens. If you want to see logprobs, the server would yield an error because of NaN value.
Looking for someone's help...

Please disable kv cache quantization

Please disable kv cache quantization

tried, but still the same bug

Cognitive Computations org

Try build from source.

I use SGLang to deploy r1 awq on 1 node A800*8 and get same empty content for some questions too.
My command is below:
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path models/DeepSeek-R1-AWQ --tp 8 --enable-p2p-check --trust-remote-code --dtype float16 --mem-fraction-static 0.9 --served-model-name deepseek-r1-awq --disable-cuda-graph

so , did anyone delpoy successful?

Cognitive Computations org

This might be related to the float16 overflow issue, please try the moe_wna16 kernel with bfloat16.

i deeploy success in vllm 0.7.2,use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

i deeploy success in vllm 0.7.2,use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

Try to download https://huggingface.co/deepseek-ai/DeepSeek-R1/tokenizer_config.json and replace your DeepSeek-R1-awq/tokenizer_config.json
If it works, you should face the problem that "model output without the '' label".
In DeepSeek-R1's document,their advice is To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

Sign up or log in to comment