CUDA error: device-side assert triggered
#41
by
arshiasoori
- opened
CUDA error: device-side assert triggered
I'm encountering a CUDA error when trying to quantize a model using BitsAndBytesConfig
with 4-bit settings. Here's the error:
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Note : I have tried
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
but nothing happened.
Quantization Setup
llm_quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
Model Loading
model = Gemma3ForConditionalGeneration.from_pretrained(
llm_model_id,
cache_dir=CACHE_DIR,
device_map="auto",
low_cpu_mem_usage=True,
use_safetensors=True,
quantization_config=llm_quantization_config
)
Environment Details
transformers: 4.50.3
CUDA Version: 12.4
GPU Driver Version: 550.144.03
Additional Notes
- When running in CPU-only mode, the notebook cell stops executing without any visible error or traceback. It just silently halts.
- I'm wondering if this might be due to a device assertion related to the model or quantization setup.
Any advice on how to debug or resolve this would be greatly appreciated!
Could this be related to the model weights / compatibility with quantization?
Solved!
Do not use torch.float16 for torch_dtype. use torch.float32 instead.
def _initialize_model(self):
"""Initialize the quantized LLM"""
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model_name = "google/gemma-3-4b-it"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float32,
device_map="cuda",
cache_dir=CACHE_DIR,
quantization_config=quantization_config,
# attn_implementation="flash_attention_2"
)
tokenizer = AutoTokenizer.from_pretrained(model_name,cache_dir=CACHE_DIR,)
return model, tokenizer