Update README.md
Browse files
README.md
CHANGED
@@ -43,36 +43,18 @@ This repo contains GPTQ model files for [Technology Innovation Institute's Falco
|
|
43 |
|
44 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
45 |
|
46 |
-
|
47 |
-
## EXPERIMENTAL
|
48 |
-
|
49 |
-
These are experimental first GPTQs for Falcon 180B. They have not yet been tested.
|
50 |
|
51 |
Transformers version 4.33.0 is required.
|
52 |
|
53 |
-
|
54 |
-
|
55 |
-
Once this change has made, they should be usable just like any other GPTQ model. You can try the example Transformers Python code later in this README, or try loading them directly from AutoGPTQ.
|
56 |
-
|
57 |
-
I believe you will need 2 x 80GB GPUs (or 4 x 48GB) to load the 4-bit models, and probably the 3-bit ones as well.
|
58 |
-
|
59 |
-
Assuming the quants finish OK (and if you're reading this message, they did!) I will test them during the day on 7th September and update this notice with my findings.
|
60 |
|
61 |
-
|
62 |
|
63 |
-
|
|
|
|
|
64 |
|
65 |
-
To join it:
|
66 |
-
|
67 |
-
Linux and macOS:
|
68 |
-
```
|
69 |
-
cat model.safetensors-split-* > model.safetensors && rm model.safetensors-split-*
|
70 |
-
```
|
71 |
-
Windows command line:
|
72 |
-
```
|
73 |
-
COPY /B model.safetensors.split-a + model.safetensors.split-b model.safetensors
|
74 |
-
del model.safetensors.split-a model.safetensors.split-b
|
75 |
-
```
|
76 |
<!-- description end -->
|
77 |
|
78 |
<!-- repositories-available start -->
|
@@ -159,40 +141,22 @@ It is strongly recommended to use the text-generation-webui one-click-installers
|
|
159 |
|
160 |
### Install the necessary packages
|
161 |
|
162 |
-
Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ
|
163 |
|
164 |
```shell
|
165 |
pip3 install transformers>=4.33.0 optimum>=1.12.0
|
166 |
-
pip3
|
167 |
-
git clone -b TB_Latest_Falcon https://github.com/TheBloke/AutoGPTQ
|
168 |
-
cd AutoGPTQ
|
169 |
-
pip3 install .
|
170 |
```
|
171 |
|
172 |
-
###
|
173 |
-
|
174 |
-
I recommend using my fast download script
|
175 |
-
|
176 |
-
```shell
|
177 |
-
git clone https://github.com/TheBlokeAI/AIScripts
|
178 |
-
python3 AIScripts/hub_download.py TheBloke/Falcon-180B-Chat-GPTQ Falcon-180B-Chat-GPTQ --branch main # change branch if you want to use the 3-bit model instead
|
179 |
-
```
|
180 |
-
|
181 |
-
### Now join the files
|
182 |
-
|
183 |
-
```shell
|
184 |
-
cd Falcon-180B-Chat-GPTQ
|
185 |
-
# Windows users: see the command to use in the Description at the top of this README
|
186 |
-
cat model.safetensors-split-* > model.safetensors && rm model.safetensors-split-*
|
187 |
-
```
|
188 |
-
|
189 |
-
### And then finally you can run the following code
|
190 |
|
191 |
```python
|
192 |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
193 |
|
194 |
-
model_name_or_path = "/
|
195 |
|
|
|
|
|
196 |
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
197 |
device_map="auto",
|
198 |
revision="main")
|
@@ -206,7 +170,7 @@ Assistant: '''
|
|
206 |
print("\n\n*** Generate:")
|
207 |
|
208 |
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
|
209 |
-
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
|
210 |
print(tokenizer.decode(output[0]))
|
211 |
|
212 |
# Inference can also be done using transformers' pipeline
|
@@ -218,6 +182,7 @@ pipe = pipeline(
|
|
218 |
tokenizer=tokenizer,
|
219 |
max_new_tokens=512,
|
220 |
temperature=0.7,
|
|
|
221 |
top_p=0.95,
|
222 |
repetition_penalty=1.15
|
223 |
)
|
|
|
43 |
|
44 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
45 |
|
46 |
+
## Requirements
|
|
|
|
|
|
|
47 |
|
48 |
Transformers version 4.33.0 is required.
|
49 |
|
50 |
+
Due to the huge size of the model, the GPTQ has been sharded. This will break compatibility with AutoGPTQ, and therefore any clients/libraries that use AutoGPTQ directly.
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
+
But they work great direct from Transformers!
|
53 |
|
54 |
+
Currently these GPTQs are tested to work with:
|
55 |
+
- Transformers 4.33.0
|
56 |
+
- Text Generation Inference (TGI) 1.0.2
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
<!-- description end -->
|
59 |
|
60 |
<!-- repositories-available start -->
|
|
|
141 |
|
142 |
### Install the necessary packages
|
143 |
|
144 |
+
Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ.
|
145 |
|
146 |
```shell
|
147 |
pip3 install transformers>=4.33.0 optimum>=1.12.0
|
148 |
+
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
|
|
|
|
|
|
149 |
```
|
150 |
|
151 |
+
### Transformers sample code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
|
153 |
```python
|
154 |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
155 |
|
156 |
+
model_name_or_path = "TheBloke/Falcon-180B-Chat-GPTQ"
|
157 |
|
158 |
+
# To use a different branch, change revision
|
159 |
+
# For example: revision="gptq-3bit--1g-actorder_True"
|
160 |
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
161 |
device_map="auto",
|
162 |
revision="main")
|
|
|
170 |
print("\n\n*** Generate:")
|
171 |
|
172 |
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
|
173 |
+
output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
|
174 |
print(tokenizer.decode(output[0]))
|
175 |
|
176 |
# Inference can also be done using transformers' pipeline
|
|
|
182 |
tokenizer=tokenizer,
|
183 |
max_new_tokens=512,
|
184 |
temperature=0.7,
|
185 |
+
do_sample=True,
|
186 |
top_p=0.95,
|
187 |
repetition_penalty=1.15
|
188 |
)
|