TheBloke commited on
Commit
5e272a1
·
1 Parent(s): 7e83507

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -95
README.md CHANGED
@@ -42,7 +42,15 @@ quantized_by: TheBloke
42
  <!-- description start -->
43
  # Description
44
 
45
- This repo contains GPTQ model files for [Mistral AI_'s Mixtral 8X7B Instruct v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
 
 
 
 
 
 
 
 
46
 
47
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
48
 
@@ -50,7 +58,7 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
50
  <!-- repositories-available start -->
51
  ## Repositories available
52
 
53
- * [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ)
54
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ)
55
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF)
56
  * [Mistral AI_'s original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
@@ -61,28 +69,9 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
61
 
62
  ```
63
  <s>[INST] {prompt} [/INST]
64
-
65
  ```
66
-
67
  <!-- prompt-template end -->
68
 
69
-
70
-
71
- <!-- README_GPTQ.md-compatible clients start -->
72
- ## Known compatible clients / servers
73
-
74
- GPTQ models are currently supported on Linux (NVidia/AMD) and Windows (NVidia only). macOS users: please use GGUF models.
75
-
76
- These GPTQ models are known to work in the following inference servers/webuis.
77
-
78
- - [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
79
- - [KoboldAI United](https://github.com/henk717/koboldai)
80
- - [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui)
81
- - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
82
-
83
- This may not be a complete list; if you know of others, please let me know!
84
- <!-- README_GPTQ.md-compatible clients end -->
85
-
86
  <!-- README_GPTQ.md-provided-files start -->
87
  ## Provided files, and GPTQ parameters
88
 
@@ -187,6 +176,8 @@ Note that using Git with HF repos is strongly discouraged. It will be much slowe
187
  <!-- README_GPTQ.md-text-generation-webui start -->
188
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
189
 
 
 
190
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
191
 
192
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
@@ -210,44 +201,6 @@ It is strongly recommended to use the text-generation-webui one-click-installers
210
 
211
  <!-- README_GPTQ.md-text-generation-webui end -->
212
 
213
- <!-- README_GPTQ.md-use-from-tgi start -->
214
- ## Serving this model from Text Generation Inference (TGI)
215
-
216
- It's recommended to use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
217
-
218
- Example Docker parameters:
219
-
220
- ```shell
221
- --model-id TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --port 3000 --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
222
- ```
223
-
224
- Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later):
225
-
226
- ```shell
227
- pip3 install huggingface-hub
228
- ```
229
-
230
- ```python
231
- from huggingface_hub import InferenceClient
232
-
233
- endpoint_url = "https://your-endpoint-url-here"
234
-
235
- prompt = "Tell me about AI"
236
- prompt_template=f'''<s>[INST] {prompt} [/INST]
237
- '''
238
-
239
- client = InferenceClient(endpoint_url)
240
- response = client.text_generation(prompt,
241
- max_new_tokens=128,
242
- do_sample=True,
243
- temperature=0.7,
244
- top_p=0.95,
245
- top_k=40,
246
- repetition_penalty=1.1)
247
-
248
- print(f"Model output: {response}")
249
- ```
250
- <!-- README_GPTQ.md-use-from-tgi end -->
251
  <!-- README_GPTQ.md-use-from-python start -->
252
  ## Python code example: inference from this GPTQ model
253
 
@@ -276,17 +229,24 @@ pip3 install .
276
  ### Example Python code
277
 
278
  ```python
279
- from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
280
-
281
  model_name_or_path = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
282
- # To use a different branch, change revision
283
- # For example: revision="gptq-4bit-128g-actorder_True"
284
- model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
285
- device_map="auto",
286
- trust_remote_code=False,
287
- revision="main")
288
 
289
- tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
290
 
291
  prompt = "Tell me about AI"
292
  prompt_template=f'''<s>[INST] {prompt} [/INST]
@@ -297,36 +257,9 @@ print("\n\n*** Generate:")
297
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
298
  output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
299
  print(tokenizer.decode(output[0]))
300
-
301
- # Inference can also be done using transformers' pipeline
302
-
303
- print("*** Pipeline:")
304
- pipe = pipeline(
305
- "text-generation",
306
- model=model,
307
- tokenizer=tokenizer,
308
- max_new_tokens=512,
309
- do_sample=True,
310
- temperature=0.7,
311
- top_p=0.95,
312
- top_k=40,
313
- repetition_penalty=1.1
314
- )
315
-
316
- print(pipe(prompt_template)[0]['generated_text'])
317
  ```
318
  <!-- README_GPTQ.md-use-from-python end -->
319
 
320
- <!-- README_GPTQ.md-compatibility start -->
321
- ## Compatibility
322
-
323
- The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly.
324
-
325
- [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama and Mistral models in 4-bit. Please see the Provided Files table above for per-file compatibility.
326
-
327
- For a list of clients/servers, please see "Known compatible clients / servers", above.
328
- <!-- README_GPTQ.md-compatibility end -->
329
-
330
  <!-- footer start -->
331
  <!-- 200823 -->
332
  ## Discord
 
42
  <!-- description start -->
43
  # Description
44
 
45
+ This repo contains **EXPERIMENTAL** GPTQ model files for [Mistral AI_'s Mixtral 8X7B Instruct v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
46
+
47
+ ## Requires AutoGPTQ PR
48
+
49
+ These files were made with, and will currently only work with, this AutoGPTQ PR: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral
50
+
51
+ To test, please build AutoGPTQ from source using that PR.
52
+
53
+ Transformers support has just arrived also via two PRs - and is expected in main Transformers + Optimum tomorrow (Dec 12th).
54
 
55
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
56
 
 
58
  <!-- repositories-available start -->
59
  ## Repositories available
60
 
61
+ * AWQ coming soon
62
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ)
63
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF)
64
  * [Mistral AI_'s original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
 
69
 
70
  ```
71
  <s>[INST] {prompt} [/INST]
 
72
  ```
 
73
  <!-- prompt-template end -->
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  <!-- README_GPTQ.md-provided-files start -->
76
  ## Provided files, and GPTQ parameters
77
 
 
176
  <!-- README_GPTQ.md-text-generation-webui start -->
177
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
178
 
179
+ **WILL CURRENTLY ONLY WORK WITH AUTOGPTQ LOADER, WITH AUTOGPTQ COMPILED FROM PR LISTED ABOVE**
180
+
181
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
182
 
183
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
 
201
 
202
  <!-- README_GPTQ.md-text-generation-webui end -->
203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  <!-- README_GPTQ.md-use-from-python start -->
205
  ## Python code example: inference from this GPTQ model
206
 
 
229
  ### Example Python code
230
 
231
  ```python
 
 
232
  model_name_or_path = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
233
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, GPTQConfig
234
+ from auto_gptq import AutoGPTQForCausalLM
 
 
 
 
235
 
236
+ model_name_or_path = args.model_dir
237
+ # To use a different branch, change revision
238
+ # For example: revision="gptq-4bit-32g-actorder_True"
239
+ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
240
+ model_basename="model",
241
+ use_safetensors=True,
242
+ trust_remote_code=False,
243
+ device="cuda:0",
244
+ use_triton=False,
245
+ disable_exllama=False,
246
+ disable_exllamav2=True,
247
+ quantize_config=None)
248
+
249
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True, trust_remote_code=False)
250
 
251
  prompt = "Tell me about AI"
252
  prompt_template=f'''<s>[INST] {prompt} [/INST]
 
257
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
258
  output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
259
  print(tokenizer.decode(output[0]))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
  ```
261
  <!-- README_GPTQ.md-use-from-python end -->
262
 
 
 
 
 
 
 
 
 
 
 
263
  <!-- footer start -->
264
  <!-- 200823 -->
265
  ## Discord