Serving on vLLM creates nonsense responses
#12
by
cahmetcan
- opened
Your GPU told you it doesn't support float16, and then you forced it to use float16, which is what half is (half the base float32).
From the vLLM documentation:
--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.
“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
“half” for FP16. Recommended for AWQ quantization.
“float16” is the same as “half”.
“bfloat16” for a balance between precision and range.
“float” is shorthand for FP32 precision.
“float32” for FP32 precision.
https://docs.vllm.ai/en/v0.4.0.post1/models/engine_args.html