Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4 · Generating through HuggingFace Transformers leads to RuntimeError: probability tensor contains either `inf`, `nan` or element < 0. Generating through vLLM encounters no issues.

When running the code given on the model card to load and generate through huggingface transformers library, I encounter RuntimeError: probability tensor contains either inf, nan or element < 0

When loading and serving the model through vLLM using the exact same model shards no errors are encountered.

What could be the problem here? torch_dtype="auto" is set in AutoModelForCausalLM.from_pretrained and trying to also manually do model = model.bfloat16() also has no effect.

Same behavior encountered with the 72B GPTQ Int4 model.

I'm using an A40 GPU