Serving on vLLM creates nonsense responses

#12
by cahmetcan - opened

used this code to run vllm. dtype is half because an error occured before saying that my gpu is not supporting float16.
!HUGGING_FACE_HUB_TOKEN=token vllm serve "google/gemma-3-1b-it" --dtype=half

as you see in the image it creates nonsense outputs
image.png

Your GPU told you it doesn't support float16, and then you forced it to use float16, which is what half is (half the base float32).

From the vLLM documentation:

--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.

“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
“half” for FP16. Recommended for AWQ quantization.
“float16” is the same as “half”.
“bfloat16” for a balance between precision and range.
“float” is shorthand for FP32 precision.
“float32” for FP32 precision.

https://docs.vllm.ai/en/v0.4.0.post1/models/engine_args.html

Pasted image (2).png

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment