CCRss commited on
Commit
ed04990
·
verified ·
1 Parent(s): fd8ec75

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Simple QLoRA Model Inference
2
+
3
+ This guide demonstrates how to perform inference using a QLoRA (Quantized Low-Rank Adaptation) fine-tuned model with a single code cell.
4
+
5
+ ## Requirements
6
+
7
+ - Python 3.7+
8
+ - PyTorch
9
+ - Transformers
10
+ - PEFT (Parameter-Efficient Fine-Tuning)
11
+ - bitsandbytes
12
+
13
+ Install the required packages:
14
+
15
+ ```
16
+ pip install torch transformers peft bitsandbytes
17
+ ```
18
+
19
+ ## Inference Code
20
+
21
+ Copy and paste the following code into a Python script or Jupyter notebook cell:
22
+
23
+ ```python
24
+ import torch
25
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
26
+ from peft import PeftModel
27
+
28
+ # Set up model paths
29
+ BASE_MODEL_PATH = "meta-llama/Meta-Llama-3.1-8B-Instruct"
30
+ ADAPTER_PATH = "CCRss/Meta-Llama-3.1-8B-Instruct-qlora-nf-ds_oasst1"
31
+
32
+ # Load tokenizer
33
+ tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
34
+ tokenizer.pad_token = tokenizer.eos_token
35
+ tokenizer.padding_side = "right"
36
+
37
+ # Load quantized model with adapter
38
+ bnb_config = BitsAndBytesConfig(
39
+ load_in_4bit=True,
40
+ bnb_4bit_quant_type="nf4",
41
+ bnb_4bit_compute_dtype=torch.float16,
42
+ )
43
+ model = AutoModelForCausalLM.from_pretrained(
44
+ BASE_MODEL_PATH,
45
+ quantization_config=bnb_config,
46
+ device_map="auto"
47
+ )
48
+ model = PeftModel.from_pretrained(model, ADAPTER_PATH)
49
+
50
+ # Generate text
51
+ prompt = "Explain quantum computing in simple terms:"
52
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
53
+ outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
54
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
55
+
56
+ print(generated_text)
57
+ ```
58
+
59
+ ## Usage
60
+
61
+ 1. Replace `BASE_MODEL_PATH` with the path to your base model.
62
+ 2. Replace `ADAPTER_PATH` with the path to your QLoRA adapter.
63
+ 3. Modify the `prompt` variable to use your desired input text.
64
+ 4. Run the code cell.
65
+
66
+ ## Customization
67
+
68
+ - Adjust `max_new_tokens`, `temperature`, and other generation parameters in the `model.generate()` function call to control the output.
69
+
70
+ ## Troubleshooting
71
+
72
+ - If you encounter CUDA out-of-memory errors, try reducing `max_new_tokens` or using a smaller model.
73
+ - Ensure your GPU drivers and CUDA toolkit are up-to-date.
74
+
75
+ For more advanced usage or optimizations, refer to the Hugging Face documentation for Transformers and PEFT.