Qwen/Qwen2.5-7B-Instruct-AWQ · Inference problems in transformers==4.50.0

After upgrading transformers to 4.50.0 I've started getting the following inference errors during regression tests of the AWQ version of this VL model (not the standard LLM, whose AWQ quant still works fine):
Some weights of the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct-AWQ were not used when initializing Qwen2_5_VLForConditionalGeneration: ['lm_head.weight']
- This IS expected if you are initializing Qwen2_5_VLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Qwen2_5_VLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Qwen2_5_VLForConditionalGeneration were not initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct-AWQ and are newly initialized: ['lm_head.qweight', 'lm_head.qzeros', 'lm_head.scales']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
  File "/home/jovyan/work/test-qwen2.5-vl-7b.py", line 56, in <module>
    generated_ids = model.generate(**inputs, max_new_tokens=128)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2326, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3332, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.