QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU

Community Article Published October 4, 2024

image/png

We can fine-tune large language models (LLMs) on consumer hardware thanks to QLoRA. This parameter-efficient fine-tuning method quantizes the model's parameters, freezes them, and then fine-tunes an adapter on top of the model.

Originally, QLoRA was proposed by the author of the bitsandbytes quantization framework. bitsandbytes quantization performs very well, thanks to using the NormalFloat4 (NF4) data type. Most of the QLoRA code that you will find online relies on bitsandbytes quantization. However, bitsandbytes has several limits. It can’t quantize to a precision lower than 4-bit and it makes the model significantly slower, as we saw in this article:

The Best Quantization Methods to Run Llama 3.1 on Your GPU

Moreover, since QLoRA has been proposed, several better quantization methods have been published. For instance, we now have HQQ, AQLM, AutoRound, and AWQ.

With Hugging Face PEFT, it is possible to use these quantization methods for QLoRA instead of bitsandbytes but their impact on fine-tuning performance is understudied.

In this article, we will experiment and compare HQQ, AQLM, AutoRound, bitsandbytes, and GPTQ for QLoRA fine-tuning. We will see how fast they are for fine-tuning and their performance with QLoRA. All the code examples presented in this article use Llama 3.1 but it would work the same for other LLMs supported by these quantization methods.

You can find the code for fine-tuning LLMs (e.g., Llama 3.1) quantized with HQQ, AQLM, AutoRound, bitsandbytes, and GPTQ in this notebook:

Get the notebook (#96)

Fine-tuning Quantized LLMs

Since QLoRA has been first implemented with bitsandbytes and is mostly used with bitsandbytes’ quantization, “QLoRA fine-tuning” often implies that bitsandbytes is used. However, rather than a particular implementation, QLoRA is a fine-tuning method. It’s LoRA fine-tuning with a quantized LLM.

If we fine-tune an adapter on top of a model quantized with GPTQ, this is still a QLoRA fine-tuning, even if we don’t use bitsandbytes.

In the experiments for this article, I only replaced the quantization method applied to the model. All the other hyperparameters remain the same. It means that we only need to modify how we load the model.

With bitsandbytes, we define a quantization configuration when we load the model. If we want to use other quantization methods, we only have to load the quantized model instead (see some examples in the following sections). Hugging Face Transformers will automatically detect the quantization algorithm. Note: Transformers must support the quantization method.

The code to fine-tune a quantized Llama 3.1 is the same that we would use for QLoRA with bitsandbytes, except for the lines loading the model.

Llama 3.1: Fine-tuning on Consumer Hardware — LoRA vs. QLoRA

You can find the quantized models I made for this article in this Hugging Face collection:

Fine-tuning AutoRound, AQLM, GPTQ, and AWQ Models

Take your QLoRA code, and replace these lines:

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
)

with:

model = AutoModelForCausalLM.from_pretrained(
          model_name, device_map={"": 0}, attn_implementation=attn_implementation
)

“model_name” is your model, already quantized. I quantized Llama 3.1 8B with GPTQ, and AutoRound. For AQLM, I used a model already available:

GPTQ and AutoRound are 4-bit models. The AQLM model is quantized to a lower precision. It makes the model smaller but not as good as the 4-bit models.

Note: In this article, we won’t see the results of fine-tuning AWQ models. I have never been able to make it work. The AWQ model, when prepared for training by the PEFT library, seems to be automatically converted to a 32-bit model which triggers out-of-memory errors.

Fine-tuning HQQ Models

We have already seen how to do it in this article:

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

It works similarly to bitsandbytes quantization. The model is quantized during loading. We have to set a quantization configuration (HqqConfig):

quant_config = HqqConfig(nbits=4, group_size=128, quant_zero=False, quant_scale=False, axis=1)

model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=quant_config, torch_dtype=torch.bfloat16, device_map={"": 0}, attn_implementation=attn_implementation
)

I couldn’t make it work with FlashAttention. The attention implementation must be Pytorch’s SDPA (attn_implementation=’sdpa’).

Learning Curves with Quantized Models

I ran QLoRA fine-tuning with bitsandbytes, GPTQ, HQQ, AQLM, and AutoRound models. For the fine-tuning data, I use again timdettmers/openassistant-guanaco. This is one of my favorite datasets for fine-tuning experiments as it is small but large enough to teach LLM how to answer instructions. One training epoch on this dataset is cheap. It’s also multilingual and distributed with an Apache 2.0 license.

The fine-tuning code for each quantization method is in the notebook.

image/png

image/png

Note: A lower loss is better.

First observation: it works. With the 5 quantization methods, the adapter could learn from the data (decreasing training loss and validation loss).

AutoRound yields the best results and slightly outperforms bitsandbytes in this setting. QLoRA with AutoRound is also learning faster (the loss decreases faster at the beginning of the training). If we had to make a ranking, we would have:

  1. AutoRound
  2. bitsandbytes
  3. HQQ
  4. GPTQ
  5. AQLM

This is the same ranking that we obtained when we compare the performance of these quantization methods:

The Best Quantization Methods to Run Llama 3.1 on Your GPU

Intuitively, it suggests that a more accurate quantization method will be better for QLoRA fine-tuning.

On the other hand, AQLM yields worse results, as expected, since it exploits a lower quantization precision. Use this AQLM model only if you don’t have enough GPU RAM for fine-tuning.

Faster Fine-tuning than with bitsandbytes

I also checked how long was the fine-tuning with each quantization method. I used Google Colab’s L4 GPU (22.5 GB of VRAM) for this experiment.

image/png

QLoRA with bitsandbytes is significantly slower than with the other quantization methods. AutoRound is as fast as GPTQ since the AutoRound model was serialized with the GPTQ format.

I think it could be even faster (maybe 30% faster) if we were using the Marlin for the GPTQ model. I didn’t try it but it should work.

Marlin: Nearly Ideal Inference Speed for 4-bit Models with vLLM (1k+ tokens/sec)

For faster fine-tuning, you can use RunPod (referral link) which offers a large choice of GPUs.

Conclusion

bitsandbytes is the most used quantization method for QLoRA fine-tuning. However, it doesn’t always yield the best results and it is significantly slower than other quantization methods.

For QLoRA fine-tuning, I recommend AutoRound. It is as good as bitsandbytes and faster. It makes fine-tuning cheaper. Quantizing models with AutoRound is also easy:

Intel AutoRound: Accurate Low-bit Quantization for LLMs

If you have a GPU with a limited VRAM, use the AQLM 2-bit version. It would be good to get a 4-bit version that may perform even better than AutoRound but I don’t know whether ISTA, the lab behind AQLM, will release one.

HQQ is also a faster alternative with the advantage that you don’t need to have an HQQ version of the model already serialized since quantization is done at loading time. Note also that HQQ has a lot of hyperparameters that you can modify to get faster and better models.

One drawback of fine-tuning adapters on top of quantized models such as GPTQ or AutoRound is that we can’t merge the adapter into the model. We will have to load the adapter every time we want to use it for inference. In theory, adapters fine-tuned with QLoRA bitsandbytes can be merged but in practice, this often results in poor performance as I discussed here:

Don't Merge Your LoRA Adapter Into a 4-bit LLM