Quantization 4Bits - 4.92 GB GPU memory usage for inference:

** Vide same fine-tuning for Llama2-7B-Chat: https://huggingface.co/nlpulse/llama2-7b-chat-english_quotes

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   1  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 37%   70C    P2   163W / 170W |   4923MiB / 12288MiB |     91%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Fine-tuning

3 epochs, all dataset samples (split=train), 939 steps
1 x GPU NVidia RTX 3060 12GB - max. GPU memory: 7.44 GB
Duration: 1h45min

$ nvidia-smi && free -h
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   1  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
|100%   89C    P2   166W / 170W |   7439MiB / 12288MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
               total        used        free      shared  buff/cache   available
Mem:            77Gi        14Gi        23Gi        79Mi        39Gi        62Gi
Swap:           37Gi          0B        37Gi

Inference

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_path = "nlpulse/gpt-j-6b-english_quotes"

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

# quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# model
model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=quant_config, device_map={"":0})

# inference
device = "cuda"
text_list = ["Ask not what your country", "Be the change that", "You only live once, but", "I'm selfish, impatient and"]
for text in text_list:
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=60)
    print('>> ', text, " => ", tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

pip install -U bitsandbytes
pip install -U git+https://github.com/huggingface/transformers.git 
pip install -U git+https://github.com/huggingface/peft.git
pip install -U accelerate
pip install -U datasets
pip install -U scipy

Scripts

https://github.com/nlpulse-io/sample_codes/tree/main/fine-tuning/peft_quantization_4bits/gptj-6b

References

QLoRa: Fine-Tune a Large Language Model on Your GPU

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Downloads last month
27
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train nlpulse/gpt-j-6b-english_quotes