Update README.md
Browse files
README.md
CHANGED
@@ -47,8 +47,8 @@ These 70B Llama 2 GGML files currently only support CPU inference. They are kno
|
|
47 |
|
48 |
## Repositories available
|
49 |
|
50 |
-
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/StableBeluga2-GPTQ)
|
51 |
-
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/StableBeluga2-GGML)
|
52 |
* [Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/stabilityai/StableBeluga2)
|
53 |
|
54 |
## Prompt template: Orca-Hashes
|
@@ -94,20 +94,20 @@ Refer to the Provided Files table below to see what files use which methods, and
|
|
94 |
|
95 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
96 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
97 |
-
| [stablebeluga2.ggmlv3.q2_K.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q2_K.bin) | q2_K | 2 | 28.59 GB| 31.09 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
|
98 |
-
| [stablebeluga2.ggmlv3.q3_K_L.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q3_K_L.bin) | q3_K_L | 3 | 36.15 GB| 38.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
99 |
-
| [stablebeluga2.ggmlv3.q3_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q3_K_M.bin) | q3_K_M | 3 | 33.04 GB| 35.54 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
100 |
-
| [stablebeluga2.ggmlv3.q3_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q3_K_S.bin) | q3_K_S | 3 | 29.75 GB| 32.25 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
|
101 |
-
| [stablebeluga2.ggmlv3.q4_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_0.bin) | q4_0 | 4 | 38.87 GB| 41.37 GB | Original quant method, 4-bit. |
|
102 |
-
| [stablebeluga2.ggmlv3.q4_1.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_1.bin) | q4_1 | 4 | 43.17 GB| 45.67 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
|
103 |
-
| [stablebeluga2.ggmlv3.q4_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_K_M.bin) | q4_K_M | 4 | 41.38 GB| 43.88 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
|
104 |
-
| [stablebeluga2.ggmlv3.q4_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_K_S.bin) | q4_K_S | 4 | 38.87 GB| 41.37 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
|
105 |
-
| [stablebeluga2.ggmlv3.q5_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q5_0.bin) | q5_0 | 5 | 47.46 GB| 49.96 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
|
106 |
-
| [stablebeluga2.ggmlv3.q5_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q5_K_M.bin) | q5_K_M | 5 | 48.75 GB| 51.25 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
|
107 |
-
| [stablebeluga2.ggmlv3.q5_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q5_K_S.bin) | q5_K_S | 5 | 47.46 GB| 49.96 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
|
108 |
-
| stablebeluga2.ggmlv3.q5_1.bin | q5_1 | 5 | 51.76 GB | 54.26 GB | Original quant method, 5-bit. Higher accuracy, slower inference than q5_0. |
|
109 |
-
| stablebeluga2.ggmlv3.q6_K.bin | q6_K | 6 | 56.59 GB | 59.09 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
|
110 |
-
| stablebeluga2.ggmlv3.q8_0.bin | q8_0 | 8 | 73.23 GB | 75.73 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|
111 |
|
112 |
### q5_1, q6_K and q8_0 files require expansion from archive
|
113 |
|
@@ -115,23 +115,23 @@ Refer to the Provided Files table below to see what files use which methods, and
|
|
115 |
|
116 |
### q5_1
|
117 |
Please download:
|
118 |
-
* `stablebeluga2.ggmlv3.q5_1.zip`
|
119 |
-
* `stablebeluga2.ggmlv3.q5_1.z01`
|
120 |
|
121 |
### q6_K
|
122 |
Please download:
|
123 |
-
* `stablebeluga2.ggmlv3.q6_K.zip`
|
124 |
-
* `stablebeluga2.ggmlv3.q6_K.z01`
|
125 |
|
126 |
### q8_0
|
127 |
Please download:
|
128 |
-
* `stablebeluga2.ggmlv3.q8_0.zip`
|
129 |
-
* `stablebeluga2.ggmlv3.q8_0.z01`
|
130 |
|
131 |
Then extract the .zip archive. This will will expand both parts automatically. On Linux I found I had to use `7zip` - the basic `unzip` tool did not work. Example:
|
132 |
```
|
133 |
sudo apt update -y && sudo apt install 7zip
|
134 |
-
7zz x stablebeluga2.ggmlv3.q6_K.zip
|
135 |
```
|
136 |
|
137 |
Once the `.bin` is extracted you can delete the `.zip` and `.z01` files.
|
|
|
47 |
|
48 |
## Repositories available
|
49 |
|
50 |
+
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/StableBeluga2-70B-GPTQ)
|
51 |
+
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/StableBeluga2-70B-GGML)
|
52 |
* [Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/stabilityai/StableBeluga2)
|
53 |
|
54 |
## Prompt template: Orca-Hashes
|
|
|
94 |
|
95 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
96 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
97 |
+
| [stablebeluga2-70b.ggmlv3.q2_K.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q2_K.bin) | q2_K | 2 | 28.59 GB| 31.09 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
|
98 |
+
| [stablebeluga2-70b.ggmlv3.q3_K_L.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q3_K_L.bin) | q3_K_L | 3 | 36.15 GB| 38.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
99 |
+
| [stablebeluga2-70b.ggmlv3.q3_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q3_K_M.bin) | q3_K_M | 3 | 33.04 GB| 35.54 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
100 |
+
| [stablebeluga2-70b.ggmlv3.q3_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q3_K_S.bin) | q3_K_S | 3 | 29.75 GB| 32.25 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
|
101 |
+
| [stablebeluga2-70b.ggmlv3.q4_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_0.bin) | q4_0 | 4 | 38.87 GB| 41.37 GB | Original quant method, 4-bit. |
|
102 |
+
| [stablebeluga2-70b.ggmlv3.q4_1.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_1.bin) | q4_1 | 4 | 43.17 GB| 45.67 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
|
103 |
+
| [stablebeluga2-70b.ggmlv3.q4_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_K_M.bin) | q4_K_M | 4 | 41.38 GB| 43.88 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
|
104 |
+
| [stablebeluga2-70b.ggmlv3.q4_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_K_S.bin) | q4_K_S | 4 | 38.87 GB| 41.37 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
|
105 |
+
| [stablebeluga2-70b.ggmlv3.q5_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q5_0.bin) | q5_0 | 5 | 47.46 GB| 49.96 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
|
106 |
+
| [stablebeluga2-70b.ggmlv3.q5_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q5_K_M.bin) | q5_K_M | 5 | 48.75 GB| 51.25 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
|
107 |
+
| [stablebeluga2-70b.ggmlv3.q5_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q5_K_S.bin) | q5_K_S | 5 | 47.46 GB| 49.96 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
|
108 |
+
| stablebeluga2-70b.ggmlv3.q5_1.bin | q5_1 | 5 | 51.76 GB | 54.26 GB | Original quant method, 5-bit. Higher accuracy, slower inference than q5_0. |
|
109 |
+
| stablebeluga2-70b.ggmlv3.q6_K.bin | q6_K | 6 | 56.59 GB | 59.09 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
|
110 |
+
| stablebeluga2-70b.ggmlv3.q8_0.bin | q8_0 | 8 | 73.23 GB | 75.73 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|
111 |
|
112 |
### q5_1, q6_K and q8_0 files require expansion from archive
|
113 |
|
|
|
115 |
|
116 |
### q5_1
|
117 |
Please download:
|
118 |
+
* `stablebeluga2-70b.ggmlv3.q5_1.zip`
|
119 |
+
* `stablebeluga2-70b.ggmlv3.q5_1.z01`
|
120 |
|
121 |
### q6_K
|
122 |
Please download:
|
123 |
+
* `stablebeluga2-70b.ggmlv3.q6_K.zip`
|
124 |
+
* `stablebeluga2-70b.ggmlv3.q6_K.z01`
|
125 |
|
126 |
### q8_0
|
127 |
Please download:
|
128 |
+
* `stablebeluga2-70b.ggmlv3.q8_0.zip`
|
129 |
+
* `stablebeluga2-70b.ggmlv3.q8_0.z01`
|
130 |
|
131 |
Then extract the .zip archive. This will will expand both parts automatically. On Linux I found I had to use `7zip` - the basic `unzip` tool did not work. Example:
|
132 |
```
|
133 |
sudo apt update -y && sudo apt install 7zip
|
134 |
+
7zz x stablebeluga2-70b.ggmlv3.q6_K.zip
|
135 |
```
|
136 |
|
137 |
Once the `.bin` is extracted you can delete the `.zip` and `.z01` files.
|