dunzhang/stella_en_1.5B_v5 · Why doesn't the VRAM go down when I quantize to 4 GB (and other issues)

The VRAM is still 3 GB when I do BnB 4bit quantization:

bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 8-bit quantization
bnb_4bit_compute_dtype=torch.float32 # Use higher precision for computation
)

Further more, all the outputs from the hidden:

last_hidden_state = self.model(**input_data)[0]
print(last_hidden_state)

Are NaN (all of them). When I quantize to 8 bits, there are still NaN but not all numbers are NaN. It seems quantization causes NaN outputs and only when unquantified does the model have the same output as the original. Can someone explain or help with this issue? (Preferably both)