TheBloke commited on
Commit
269bda6
1 Parent(s): 03beaf0

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +52 -38
README.md CHANGED
@@ -31,18 +31,23 @@ quantized_by: TheBloke
31
  - Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
32
  - Original model: [Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
33
 
 
34
  ## Description
35
 
36
  This repo contains GPTQ model files for [Jon Durbin's Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1).
37
 
38
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
39
 
 
 
40
  ## Repositories available
41
 
42
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ)
43
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GGUF)
44
  * [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
 
45
 
 
46
  ## Prompt template: Chat
47
 
48
  ```
@@ -52,6 +57,9 @@ ASSISTANT:
52
 
53
  ```
54
 
 
 
 
55
  ## Provided files and GPTQ parameters
56
 
57
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
@@ -65,7 +73,7 @@ All GPTQ files are made with AutoGPTQ.
65
 
66
  - Bits: The bit size of the quantised model.
67
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
68
- - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
69
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
70
  - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
71
  - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
@@ -82,6 +90,9 @@ All GPTQ files are made with AutoGPTQ.
82
  | [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
83
  | [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 14.14 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
84
 
 
 
 
85
  ## How to download from branches
86
 
87
  - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
@@ -90,73 +101,72 @@ All GPTQ files are made with AutoGPTQ.
90
  git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ
91
  ```
92
  - In Python Transformers code, the branch is the `revision` parameter; see below.
93
-
 
94
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
95
 
96
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
97
 
98
- It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
99
 
100
  1. Click the **Model tab**.
101
  2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-c34B-2.1-GPTQ`.
102
  - To download from a specific branch, enter for example `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
103
  - see Provided Files above for the list of branches for each option.
104
  3. Click **Download**.
105
- 4. The model will start downloading. Once it's finished it will say "Done"
106
  5. In the top left, click the refresh icon next to **Model**.
107
  6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-c34B-2.1-GPTQ`
108
  7. The model will automatically load, and is now ready for use!
109
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
110
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
111
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 
112
 
 
113
  ## How to use this GPTQ model from Python code
114
 
115
- First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 0.3.1 or later installed:
116
 
117
- ```
118
- pip3 install auto-gptq
119
- ```
120
 
121
- If you have problems installing AutoGPTQ, please build from source instead:
 
 
122
  ```
 
 
 
 
123
  pip3 uninstall -y auto-gptq
124
  git clone https://github.com/PanQiWei/AutoGPTQ
125
  cd AutoGPTQ
126
  pip3 install .
127
  ```
128
 
129
- Then try the following example code:
 
 
 
 
 
 
 
 
130
 
131
  ```python
132
- from transformers import AutoTokenizer, pipeline, logging
133
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
134
 
135
  model_name_or_path = "TheBloke/Airoboros-c34B-2.1-GPTQ"
136
-
137
- use_triton = False
 
 
 
 
138
 
139
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
140
 
141
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
142
- use_safetensors=True,
143
- trust_remote_code=False,
144
- device="cuda:0",
145
- use_triton=use_triton,
146
- quantize_config=None)
147
-
148
- """
149
- # To download from a specific branch, use the revision parameter, as in this example:
150
- # Note that `revision` requires AutoGPTQ 0.3.1 or later!
151
-
152
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
153
- revision="gptq-4bit-32g-actorder_True",
154
- use_safetensors=True,
155
- trust_remote_code=False,
156
- device="cuda:0",
157
- quantize_config=None)
158
- """
159
-
160
  prompt = "Tell me about AI"
161
  prompt_template=f'''A chat
162
  USER: {prompt}
@@ -172,9 +182,6 @@ print(tokenizer.decode(output[0]))
172
 
173
  # Inference can also be done using transformers' pipeline
174
 
175
- # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
176
- logging.set_verbosity(logging.CRITICAL)
177
-
178
  print("*** Pipeline:")
179
  pipe = pipeline(
180
  "text-generation",
@@ -188,12 +195,17 @@ pipe = pipeline(
188
 
189
  print(pipe(prompt_template)[0]['generated_text'])
190
  ```
 
191
 
 
192
  ## Compatibility
193
 
194
- The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
 
 
195
 
196
- ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
 
197
 
198
  <!-- footer start -->
199
  <!-- 200823 -->
@@ -254,6 +266,8 @@ This is an instruction fine-tuned llama-2 model, using synthetic data generated
254
  - these models just produce text, what you do with that text is your resonsibility
255
  - many people and industries deal with "sensitive" content; imagine if a court stenographer's eqipment filtered illegal content - it would be useless
256
 
 
 
257
  ### Prompt format
258
 
259
  The training code was updated to randomize newline vs space:
 
31
  - Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
32
  - Original model: [Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
33
 
34
+ <!-- description start -->
35
  ## Description
36
 
37
  This repo contains GPTQ model files for [Jon Durbin's Airoboros c34B 2.1](https://huggingface.co/jondurbin/airoboros-c34b-2.1).
38
 
39
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
40
 
41
+ <!-- description end -->
42
+ <!-- repositories-available start -->
43
  ## Repositories available
44
 
45
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ)
46
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GGUF)
47
  * [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-c34b-2.1)
48
+ <!-- repositories-available end -->
49
 
50
+ <!-- prompt-template start -->
51
  ## Prompt template: Chat
52
 
53
  ```
 
57
 
58
  ```
59
 
60
+ <!-- prompt-template end -->
61
+
62
+ <!-- README_GPTQ.md-provided-files start -->
63
  ## Provided files and GPTQ parameters
64
 
65
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
 
73
 
74
  - Bits: The bit size of the quantised model.
75
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
76
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
77
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
78
  - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
79
  - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
 
90
  | [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
91
  | [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 14.14 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
92
 
93
+ <!-- README_GPTQ.md-provided-files end -->
94
+
95
+ <!-- README_GPTQ.md-download-from-branches start -->
96
  ## How to download from branches
97
 
98
  - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
 
101
  git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GPTQ
102
  ```
103
  - In Python Transformers code, the branch is the `revision` parameter; see below.
104
+ <!-- README_GPTQ.md-download-from-branches end -->
105
+ <!-- README_GPTQ.md-text-generation-webui start -->
106
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
107
 
108
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
109
 
110
+ It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
111
 
112
  1. Click the **Model tab**.
113
  2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-c34B-2.1-GPTQ`.
114
  - To download from a specific branch, enter for example `TheBloke/Airoboros-c34B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
115
  - see Provided Files above for the list of branches for each option.
116
  3. Click **Download**.
117
+ 4. The model will start downloading. Once it's finished it will say "Done".
118
  5. In the top left, click the refresh icon next to **Model**.
119
  6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-c34B-2.1-GPTQ`
120
  7. The model will automatically load, and is now ready for use!
121
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
122
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
123
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
124
+ <!-- README_GPTQ.md-text-generation-webui end -->
125
 
126
+ <!-- README_GPTQ.md-use-from-python start -->
127
  ## How to use this GPTQ model from Python code
128
 
129
+ ### Install the necessary packages
130
 
131
+ Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
 
 
132
 
133
+ ```shell
134
+ pip3 install transformers>=4.32.0 optimum>=1.12.0
135
+ pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
136
  ```
137
+
138
+ If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
139
+
140
+ ```shell
141
  pip3 uninstall -y auto-gptq
142
  git clone https://github.com/PanQiWei/AutoGPTQ
143
  cd AutoGPTQ
144
  pip3 install .
145
  ```
146
 
147
+ ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
148
+
149
+ If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
150
+ ```shell
151
+ pip3 uninstall -y transformers
152
+ pip3 install git+https://github.com/huggingface/transformers.git
153
+ ```
154
+
155
+ ### You can then use the following code
156
 
157
  ```python
158
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
159
 
160
  model_name_or_path = "TheBloke/Airoboros-c34B-2.1-GPTQ"
161
+ # To use a different branch, change revision
162
+ # For example: revision="gptq-4bit-32g-actorder_True"
163
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
164
+ torch_dtype=torch.float16,
165
+ device_map="auto",
166
+ revision="main")
167
 
168
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  prompt = "Tell me about AI"
171
  prompt_template=f'''A chat
172
  USER: {prompt}
 
182
 
183
  # Inference can also be done using transformers' pipeline
184
 
 
 
 
185
  print("*** Pipeline:")
186
  pipe = pipeline(
187
  "text-generation",
 
195
 
196
  print(pipe(prompt_template)[0]['generated_text'])
197
  ```
198
+ <!-- README_GPTQ.md-use-from-python end -->
199
 
200
+ <!-- README_GPTQ.md-compatibility start -->
201
  ## Compatibility
202
 
203
+ The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
204
+
205
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
206
 
207
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
208
+ <!-- README_GPTQ.md-compatibility end -->
209
 
210
  <!-- footer start -->
211
  <!-- 200823 -->
 
266
  - these models just produce text, what you do with that text is your resonsibility
267
  - many people and industries deal with "sensitive" content; imagine if a court stenographer's eqipment filtered illegal content - it would be useless
268
 
269
+ Huge thank you to the folks over at [a16z](https://a16z.com/) for sponsoring the costs associated with building models and associated tools!
270
+
271
  ### Prompt format
272
 
273
  The training code was updated to randomize newline vs space: