TheBloke commited on
Commit
de4a7c0
·
1 Parent(s): a128078

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -4
README.md CHANGED
@@ -40,6 +40,16 @@ Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for pro
40
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
41
  * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-70b-hf)
42
 
 
 
 
 
 
 
 
 
 
 
43
  ## Prompt template: None
44
 
45
  ```
@@ -54,10 +64,10 @@ Each separate quant is in a different branch. See below for instructions on fet
54
 
55
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
56
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
57
- | main | 4 | None | True | 35.33 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
58
- | gptq-4bit-32g-actorder_True | 4 | 32 | True | Still processing | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
59
- | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
60
- | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
61
  | gptq-3bit--1g-actorder_True | 3 | None | True | Still processing | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
62
  | gptq-3bit-128g-actorder_False | 3 | 128 | False | Still processing | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
63
  | gptq-3bit-128g-actorder_True | 3 | 128 | True | Still processing | False | AutoGPTQ | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
@@ -78,6 +88,15 @@ Please make sure you're using the latest version of [text-generation-webui](http
78
 
79
  It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
80
 
 
 
 
 
 
 
 
 
 
81
  1. Click the **Model tab**.
82
  2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
83
  - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True`
@@ -97,6 +116,11 @@ First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) instal
97
 
98
  `GITHUB_ACTIONS=true pip install auto-gptq`
99
 
 
 
 
 
 
100
  Then try the following example code:
101
 
102
  ```python
 
40
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
41
  * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-70b-hf)
42
 
43
+ ## Required: latest version of Transformers
44
+
45
+ Before trying these GPTQs, please update Transformers to the latest Github code:
46
+
47
+ ```
48
+ pip3 install git+https://github.com/huggingface/transformers
49
+ ```
50
+
51
+ If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
52
+
53
  ## Prompt template: None
54
 
55
  ```
 
64
 
65
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
66
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
67
+ | main | 4 | None | True | 35.33 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
68
+ | gptq-4bit-32g-actorder_True | 4 | 32 | True | Still processing | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
69
+ | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
70
+ | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
71
  | gptq-3bit--1g-actorder_True | 3 | None | True | Still processing | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
72
  | gptq-3bit-128g-actorder_False | 3 | 128 | False | Still processing | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
73
  | gptq-3bit-128g-actorder_True | 3 | 128 | True | Still processing | False | AutoGPTQ | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
 
88
 
89
  It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
90
 
91
+ Note: ExLlama is not currently compatible with Llama 2 70B. Please try GPTQ-for-LLaMa, or AutoGPTQ.
92
+
93
+ Remember to update Transformers to latest Github version before trying to use this model:
94
+
95
+ ```
96
+ pip3 install git+https://github.com/huggingface/transformers
97
+ ```
98
+
99
+
100
  1. Click the **Model tab**.
101
  2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
102
  - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True`
 
116
 
117
  `GITHUB_ACTIONS=true pip install auto-gptq`
118
 
119
+ And update Transformers to the latest version:
120
+ ```
121
+ pip3 install git+https://github.com/huggingface/transformers
122
+ ```
123
+
124
  Then try the following example code:
125
 
126
  ```python