TheBloke
/

WizardCoder-Guanaco-15B-V1.0-GPTQ

@@ -1,6 +1,13 @@
 ---
 inference: false
-license: other
 ---
 <!-- header start -->
@@ -26,7 +33,7 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.0-GPTQ)
-* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.0-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/LoupGarou/WizardCoder-Guanaco-15B-V1.0)
 ## Prompt template: Alpaca
@@ -53,9 +60,10 @@ It is strongly recommended to use the text-generation-webui one-click-installers
 5. In the top left, click the refresh icon next to **Model**.
 6. In the **Model** dropdown, choose the model you just downloaded: `WizardCoder-Guanaco-15B-V1.0-GPTQ`
 7. The model will automatically load, and is now ready for use!
-8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
-9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 ## How to use this GPTQ model from Python code
@@ -127,14 +135,14 @@ print(pipe(prompt_template)[0]['generated_text'])
 This will work with AutoGPTQ and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
-If a Llama model, it will also be supported by ExLlama, which will provide 2x speedup over AutoGPTQ and GPTQ-for-LLaMa.
 It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed.
 * `wizardcoder-guanaco-15b-v1.0-GPTQ-4bit-128g.no-act.order.safetensors`
  * Works with AutoGPTQ in CUDA or Triton modes.
- * [ExLlama](https://github.com/turboderp/exllama) suupports Llama 4-bit GPTQs, and will provide 2x speedup over AutoGPTQ and GPTQ-for-LLaMa.
- * Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
  * Works with text-generation-webui, including one-click-installers.
  * Parameters: Groupsize = 128. Act Order / desc_act = False.

 ---
 inference: false
+language:
+- en
+datasets:
+- guanaco
+model_hub_library:
+- transformers
+license:
+- apache-2.0
 ---
 <!-- header start -->
 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.0-GPTQ)
+* [4, 5, and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.0-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/LoupGarou/WizardCoder-Guanaco-15B-V1.0)
 ## Prompt template: Alpaca
 5. In the top left, click the refresh icon next to **Model**.
 6. In the **Model** dropdown, choose the model you just downloaded: `WizardCoder-Guanaco-15B-V1.0-GPTQ`
 7. The model will automatically load, and is now ready for use!
+8. If you have problems, make sure that **Loader** is set to **AutoGPTQ**.
+9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
+10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 ## How to use this GPTQ model from Python code
 This will work with AutoGPTQ and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
+As this is not a Llama model, it will not be supported by ExLlama.
 It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed.
 * `wizardcoder-guanaco-15b-v1.0-GPTQ-4bit-128g.no-act.order.safetensors`
  * Works with AutoGPTQ in CUDA or Triton modes.
+ * Does NOT work with [ExLlama](https://github.com/turboderp/exllama).
+ * Untested with GPTQ-for-LLaMa.
  * Works with text-generation-webui, including one-click-installers.
  * Parameters: Groupsize = 128. Act Order / desc_act = False.