google/gemma-3-4b-it · CUDA error: device-side assert triggered

CUDA error: device-side assert triggered

I'm encountering a CUDA error when trying to quantize a model using BitsAndBytesConfig with 4-bit settings. Here's the error:

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note : I have tried os.environ['CUDA_LAUNCH_BLOCKING'] = '1' but nothing happened.

Quantization Setup

llm_quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  
    bnb_4bit_compute_dtype=torch.float16,  
    bnb_4bit_quant_type="nf4",  
    bnb_4bit_use_double_quant=True, 
)

Model Loading

model = Gemma3ForConditionalGeneration.from_pretrained(
    llm_model_id,
    cache_dir=CACHE_DIR,
    device_map="auto",
    low_cpu_mem_usage=True,
    use_safetensors=True,
    quantization_config=llm_quantization_config
)

Environment Details

transformers: 4.50.3
CUDA Version: 12.4
GPU Driver Version: 550.144.03

Additional Notes

When running in CPU-only mode, the notebook cell stops executing without any visible error or traceback. It just silently halts.
I'm wondering if this might be due to a device assertion related to the model or quantization setup.

Any advice on how to debug or resolve this would be greatly appreciated!
Could this be related to the model weights / compatibility with quantization?

google
/

gemma-3-4b-it

CUDA error: device-side assert triggered

CUDA error: device-side assert triggered

Quantization Setup

Model Loading

Environment Details

Additional Notes

Solved!

Do not use torch.float16 for torch_dtype. use torch.float32 instead.