Initial GPTQ model commit
Browse files
README.md
CHANGED
@@ -31,18 +31,23 @@ quantized_by: TheBloke
|
|
31 |
- Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
|
32 |
- Original model: [Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
|
33 |
|
|
|
34 |
## Description
|
35 |
|
36 |
This repo contains GPTQ model files for [Jon Durbin's Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1).
|
37 |
|
38 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
39 |
|
|
|
|
|
40 |
## Repositories available
|
41 |
|
42 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ)
|
43 |
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GGUF)
|
44 |
* [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
|
|
|
45 |
|
|
|
46 |
## Prompt template: Chat
|
47 |
|
48 |
```
|
@@ -52,6 +57,9 @@ ASSISTANT:
|
|
52 |
|
53 |
```
|
54 |
|
|
|
|
|
|
|
55 |
## Provided files and GPTQ parameters
|
56 |
|
57 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
@@ -65,7 +73,7 @@ All GPTQ files are made with AutoGPTQ.
|
|
65 |
|
66 |
- Bits: The bit size of the quantised model.
|
67 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
68 |
-
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
|
69 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
70 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
71 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
@@ -82,6 +90,9 @@ All GPTQ files are made with AutoGPTQ.
|
|
82 |
| [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
83 |
| [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 14.14 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
|
84 |
|
|
|
|
|
|
|
85 |
## How to download from branches
|
86 |
|
87 |
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
@@ -90,73 +101,72 @@ All GPTQ files are made with AutoGPTQ.
|
|
90 |
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ
|
91 |
```
|
92 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
93 |
-
|
|
|
94 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
95 |
|
96 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
97 |
|
98 |
-
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
|
99 |
|
100 |
1. Click the **Model tab**.
|
101 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-c34B-2.1-GPTQ`.
|
102 |
- To download from a specific branch, enter for example `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
103 |
- see Provided Files above for the list of branches for each option.
|
104 |
3. Click **Download**.
|
105 |
-
4. The model will start downloading. Once it's finished it will say "Done"
|
106 |
5. In the top left, click the refresh icon next to **Model**.
|
107 |
6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-c34B-2.1-GPTQ`
|
108 |
7. The model will automatically load, and is now ready for use!
|
109 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
110 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
111 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
|
|
112 |
|
|
|
113 |
## How to use this GPTQ model from Python code
|
114 |
|
115 |
-
|
116 |
|
117 |
-
|
118 |
-
pip3 install auto-gptq
|
119 |
-
```
|
120 |
|
121 |
-
|
|
|
|
|
122 |
```
|
|
|
|
|
|
|
|
|
123 |
pip3 uninstall -y auto-gptq
|
124 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
125 |
cd AutoGPTQ
|
126 |
pip3 install .
|
127 |
```
|
128 |
|
129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
|
131 |
```python
|
132 |
-
from transformers import AutoTokenizer, pipeline
|
133 |
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
134 |
|
135 |
model_name_or_path = "TheBloke/Airoboros-c34B-2.1-GPTQ"
|
136 |
-
|
137 |
-
|
|
|
|
|
|
|
|
|
138 |
|
139 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
140 |
|
141 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
142 |
-
use_safetensors=True,
|
143 |
-
trust_remote_code=False,
|
144 |
-
device="cuda:0",
|
145 |
-
use_triton=use_triton,
|
146 |
-
quantize_config=None)
|
147 |
-
|
148 |
-
"""
|
149 |
-
# To download from a specific branch, use the revision parameter, as in this example:
|
150 |
-
# Note that `revision` requires AutoGPTQ 0.3.1 or later!
|
151 |
-
|
152 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
153 |
-
revision="gptq-4bit-32g-actorder_True",
|
154 |
-
use_safetensors=True,
|
155 |
-
trust_remote_code=False,
|
156 |
-
device="cuda:0",
|
157 |
-
quantize_config=None)
|
158 |
-
"""
|
159 |
-
|
160 |
prompt = "Tell me about AI"
|
161 |
prompt_template=f'''A chat
|
162 |
USER: {prompt}
|
@@ -172,9 +182,6 @@ print(tokenizer.decode(output[0]))
|
|
172 |
|
173 |
# Inference can also be done using transformers' pipeline
|
174 |
|
175 |
-
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
176 |
-
logging.set_verbosity(logging.CRITICAL)
|
177 |
-
|
178 |
print("*** Pipeline:")
|
179 |
pipe = pipeline(
|
180 |
"text-generation",
|
@@ -188,12 +195,17 @@ pipe = pipeline(
|
|
188 |
|
189 |
print(pipe(prompt_template)[0]['generated_text'])
|
190 |
```
|
|
|
191 |
|
|
|
192 |
## Compatibility
|
193 |
|
194 |
-
The files provided
|
|
|
|
|
195 |
|
196 |
-
|
|
|
197 |
|
198 |
<!-- footer start -->
|
199 |
<!-- 200823 -->
|
@@ -254,6 +266,8 @@ This is an instruction fine-tuned llama-2 model, using synthetic data generated
|
|
254 |
- these models just produce text, what you do with that text is your resonsibility
|
255 |
- many people and industries deal with "sensitive" content; imagine if a court stenographer's eqipment filtered illegal content - it would be useless
|
256 |
|
|
|
|
|
257 |
### Prompt format
|
258 |
|
259 |
The training code was updated to randomize newline vs space:
|
|
|
31 |
- Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
|
32 |
- Original model: [Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
|
33 |
|
34 |
+
<!-- description start -->
|
35 |
## Description
|
36 |
|
37 |
This repo contains GPTQ model files for [Jon Durbin's Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1).
|
38 |
|
39 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
40 |
|
41 |
+
<!-- description end -->
|
42 |
+
<!-- repositories-available start -->
|
43 |
## Repositories available
|
44 |
|
45 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ)
|
46 |
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GGUF)
|
47 |
* [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
|
48 |
+
<!-- repositories-available end -->
|
49 |
|
50 |
+
<!-- prompt-template start -->
|
51 |
## Prompt template: Chat
|
52 |
|
53 |
```
|
|
|
57 |
|
58 |
```
|
59 |
|
60 |
+
<!-- prompt-template end -->
|
61 |
+
|
62 |
+
<!-- README_GPTQ.md-provided-files start -->
|
63 |
## Provided files and GPTQ parameters
|
64 |
|
65 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
|
|
73 |
|
74 |
- Bits: The bit size of the quantised model.
|
75 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
76 |
+
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
|
77 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
78 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
79 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
|
|
90 |
| [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
91 |
| [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 14.14 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
|
92 |
|
93 |
+
<!-- README_GPTQ.md-provided-files end -->
|
94 |
+
|
95 |
+
<!-- README_GPTQ.md-download-from-branches start -->
|
96 |
## How to download from branches
|
97 |
|
98 |
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
|
|
101 |
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ
|
102 |
```
|
103 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
104 |
+
<!-- README_GPTQ.md-download-from-branches end -->
|
105 |
+
<!-- README_GPTQ.md-text-generation-webui start -->
|
106 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
107 |
|
108 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
109 |
|
110 |
+
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
|
111 |
|
112 |
1. Click the **Model tab**.
|
113 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-c34B-2.1-GPTQ`.
|
114 |
- To download from a specific branch, enter for example `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
115 |
- see Provided Files above for the list of branches for each option.
|
116 |
3. Click **Download**.
|
117 |
+
4. The model will start downloading. Once it's finished it will say "Done".
|
118 |
5. In the top left, click the refresh icon next to **Model**.
|
119 |
6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-c34B-2.1-GPTQ`
|
120 |
7. The model will automatically load, and is now ready for use!
|
121 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
122 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
123 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
124 |
+
<!-- README_GPTQ.md-text-generation-webui end -->
|
125 |
|
126 |
+
<!-- README_GPTQ.md-use-from-python start -->
|
127 |
## How to use this GPTQ model from Python code
|
128 |
|
129 |
+
### Install the necessary packages
|
130 |
|
131 |
+
Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
|
|
|
|
|
132 |
|
133 |
+
```shell
|
134 |
+
pip3 install transformers>=4.32.0 optimum>=1.12.0
|
135 |
+
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
136 |
```
|
137 |
+
|
138 |
+
If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
|
139 |
+
|
140 |
+
```shell
|
141 |
pip3 uninstall -y auto-gptq
|
142 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
143 |
cd AutoGPTQ
|
144 |
pip3 install .
|
145 |
```
|
146 |
|
147 |
+
### For CodeLlama models only: you must use Transformers 4.33.0 or later.
|
148 |
+
|
149 |
+
If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
|
150 |
+
```shell
|
151 |
+
pip3 uninstall -y transformers
|
152 |
+
pip3 install git+https://github.com/huggingface/transformers.git
|
153 |
+
```
|
154 |
+
|
155 |
+
### You can then use the following code
|
156 |
|
157 |
```python
|
158 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
|
|
159 |
|
160 |
model_name_or_path = "TheBloke/Airoboros-c34B-2.1-GPTQ"
|
161 |
+
# To use a different branch, change revision
|
162 |
+
# For example: revision="gptq-4bit-32g-actorder_True"
|
163 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
164 |
+
torch_dtype=torch.float16,
|
165 |
+
device_map="auto",
|
166 |
+
revision="main")
|
167 |
|
168 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
169 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
prompt = "Tell me about AI"
|
171 |
prompt_template=f'''A chat
|
172 |
USER: {prompt}
|
|
|
182 |
|
183 |
# Inference can also be done using transformers' pipeline
|
184 |
|
|
|
|
|
|
|
185 |
print("*** Pipeline:")
|
186 |
pipe = pipeline(
|
187 |
"text-generation",
|
|
|
195 |
|
196 |
print(pipe(prompt_template)[0]['generated_text'])
|
197 |
```
|
198 |
+
<!-- README_GPTQ.md-use-from-python end -->
|
199 |
|
200 |
+
<!-- README_GPTQ.md-compatibility start -->
|
201 |
## Compatibility
|
202 |
|
203 |
+
The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
|
204 |
+
|
205 |
+
[ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
|
206 |
|
207 |
+
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
|
208 |
+
<!-- README_GPTQ.md-compatibility end -->
|
209 |
|
210 |
<!-- footer start -->
|
211 |
<!-- 200823 -->
|
|
|
266 |
- these models just produce text, what you do with that text is your resonsibility
|
267 |
- many people and industries deal with "sensitive" content; imagine if a court stenographer's eqipment filtered illegal content - it would be useless
|
268 |
|
269 |
+
Huge thank you to the folks over at [a16z](https://a16z.com/) for sponsoring the costs associated with building models and associated tools!
|
270 |
+
|
271 |
### Prompt format
|
272 |
|
273 |
The training code was updated to randomize newline vs space:
|