CUDA error: device-side assert triggered

#41
by arshiasoori - opened

CUDA error: device-side assert triggered

I'm encountering a CUDA error when trying to quantize a model using BitsAndBytesConfig with 4-bit settings. Here's the error:

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note : I have tried os.environ['CUDA_LAUNCH_BLOCKING'] = '1' but nothing happened.


Quantization Setup

llm_quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  
    bnb_4bit_compute_dtype=torch.float16,  
    bnb_4bit_quant_type="nf4",  
    bnb_4bit_use_double_quant=True, 
)

Model Loading

model = Gemma3ForConditionalGeneration.from_pretrained(
    llm_model_id,
    cache_dir=CACHE_DIR,
    device_map="auto",
    low_cpu_mem_usage=True,
    use_safetensors=True,
    quantization_config=llm_quantization_config
)

Environment Details

transformers: 4.50.3
CUDA Version: 12.4
GPU Driver Version: 550.144.03

Additional Notes

  • When running in CPU-only mode, the notebook cell stops executing without any visible error or traceback. It just silently halts.
  • I'm wondering if this might be due to a device assertion related to the model or quantization setup.

Any advice on how to debug or resolve this would be greatly appreciated!
Could this be related to the model weights / compatibility with quantization?


Solved!


Do not use torch.float16 for torch_dtype. use torch.float32 instead.

    def _initialize_model(self):
        """Initialize the quantized LLM"""
        quantization_config = BitsAndBytesConfig(load_in_4bit=True)
        
        model_name = "google/gemma-3-4b-it"

        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float32,
            device_map="cuda",
            cache_dir=CACHE_DIR,
            quantization_config=quantization_config,
            # attn_implementation="flash_attention_2"
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name,cache_dir=CACHE_DIR,)
        return model, tokenizer
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment