Updated the RAM usage chart for the ggmlv3 models
Browse files
README.md
CHANGED
@@ -12,11 +12,34 @@ datasets:
|
|
12 |
|
13 |
### This repository contains quantized conversions of EleutherAI's Pythia Deduped checkpoints.
|
14 |
|
15 |
-
|
16 |
-
- Converted with ggerganov/ggml's gpt-neox conversion script, and tested with KoboldCpp.
|
17 |
-
- I can't promise that this will work, especially with other frontends. ~~I've had problems when generating words like "Alice" or "Hakurei" / "Gensokyo". Could be related to the ggml implementation of GPT-NeoX having a "hacked" tokenizer [(source)](https://github.com/ggerganov/ggml/tree/master/examples/gpt-neox#notes).~~ **This seems to have been improved with KoboldCpp v1.25.1 and the ggmlv3 versions of these models.**
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
**2023-04-20:** *q4_3. Used [commit 05f3079](https://github.com/ggerganov/ggml/tree/05f307971862b83df12fada0c42ee027ba5a82b5/examples/stablelm)*
|
22 |
|
@@ -26,27 +49,10 @@ Versions:
|
|
26 |
|
27 |
**2023-05-15:** *New quantization format (ggmlv2). q4_0 and q5_1, up to 2.8B. Used [commit 010203f](https://github.com/ggerganov/ggml/tree/010203f94a85df5c86b773dc5acb698c8e7b1e7b/examples/gpt-neox)*
|
28 |
|
29 |
-
**2023-05-25
|
30 |
|
31 |
They're separated by date and commit so it's easier to track of any breaking changes.
|
32 |
|
33 |
-
# RAM USAGE (on KoboldCpp w/ OpenBLAS)
|
34 |
-
Model | Initial RAM | After generation
|
35 |
-
:--:|:--:|:--:
|
36 |
-
Unloaded | 41.3 MiB
|
37 |
-
ggml-pythia-70m-deduped-q4_0.bin | 113.3 MiB | 267.8 MiB
|
38 |
-
ggml-pythia-70m-deduped-q5_1.bin | 121.5 MiB | 129.4 MiB
|
39 |
-
ggml-pythia-160m-deduped-q4_0.bin | 199.4 MiB | 201.6 MiB
|
40 |
-
ggml-pythia-160m-deduped-q5_1.bin | 227.5 MiB | 241.0 MiB
|
41 |
-
ggml-pythia-410m-deduped-q4_0.bin | 399.2 MiB | 406.2 MiB
|
42 |
-
ggml-pythia-410m-deduped-q5_1.bin | 455.7 MiB | 460.3 MiB
|
43 |
-
ggml-pythia-1b-deduped-q4_0.bin | 803.0 MiB | 809.0 MiB
|
44 |
-
ggml-pythia-1b-deduped-q5_1.bin | 921.5 MiB | 927.3 MiB
|
45 |
-
ggml-pythia-1.4b-deduped-q4_0.bin | 1.1 GiB | 1.1 GiB
|
46 |
-
ggml-pythia-1.4b-deduped-q5_1.bin | 1.3 GiB | 1.3 GiB
|
47 |
-
ggml-pythia-2.8b-deduped-q4_0.bin | 2.0 GiB | 2.0 GiB
|
48 |
-
ggml-pythia-2.8b-deduped-q5_1.bin | 2.4 GiB | 2.4 GiB
|
49 |
-
|
50 |
# ALTERNATIVES
|
51 |
If you're here because you want a smaller model to run on a device with constrained memory, consider the following:
|
52 |
- OpenLLaMA [3B](https://huggingface.co/openlm-research/open_llama_3b_350bt_preview) [(7B)](https://huggingface.co/openlm-research/open_llama_7b_400bt_preview)
|
|
|
12 |
|
13 |
### This repository contains quantized conversions of EleutherAI's Pythia Deduped checkpoints.
|
14 |
|
15 |
+
If you're starting off, I highly recommend for you to get models from the newest directory [(2023-05-25)](https://huggingface.co/Merry/ggml-pythia-deduped/tree/main/2023-05-25).
|
|
|
|
|
16 |
|
17 |
+
# RAM USAGE
|
18 |
+
Model | RAM usage
|
19 |
+
:--:|:--:
|
20 |
+
Unloaded | 41.3 MiB
|
21 |
+
|
|
22 |
+
ggmlv3-pythia-70m-deduped-q4_0.bin | 95.5 MiB
|
23 |
+
ggmlv3-pythia-160m-deduped-q4_0.bin | 201.1 MiB
|
24 |
+
ggmlv3-pythia-410m-deduped-q4_0.bin | 415.1 MiB
|
25 |
+
ggmlv3-pythia-1b-deduped-q4_0.bin | 762.2 MiB
|
26 |
+
ggmlv3-pythia-1.4b-deduped-q4_0.bin | 1.0 GiB
|
27 |
+
ggmlv3-pythia-2.8b-deduped-q4_0.bin | 1.9 GiB
|
28 |
+
|
|
29 |
+
ggmlv3-pythia-70m-deduped-q5_1.bin | 108.7 MiB
|
30 |
+
ggmlv3-pythia-160m-deduped-q5_1.bin | 226.9 MiB
|
31 |
+
ggmlv3-pythia-410m-deduped-q5_1.bin | 494.0 MiB
|
32 |
+
ggmlv3-pythia-1b-deduped-q5_1.bin | 943.9 MiB
|
33 |
+
ggmlv3-pythia-1.4b-deduped-q5_1.bin | 1.3 GiB
|
34 |
+
ggmlv3-pythia-2.8b-deduped-q5_1.bin | 2.3 GiB
|
35 |
+
|
36 |
+
*Tested on KoboldCpp with OpenBLAS enabled.*
|
37 |
+
|
38 |
+
**Notes:**
|
39 |
+
- The models have been converted with ggerganov/ggml's gpt-neox conversion script, and tested only on KoboldCpp. Other frontends that support GGML-based conversions of GPT-NeoX *should* work, but I can't promise anything.
|
40 |
+
- They're sorted by date based on when they were made so it was easier to track breaking changes. If you're just starting off I highly recommend the latest (which is 2023-05-25). Combined with KoboldCpp v1.25.1+ this improved the tokenizer, which in my testing reduces occurrences of broken words like "Alicae" or "Reimu Hai-ku-rei".
|
41 |
+
|
42 |
+
**Versions:**
|
43 |
|
44 |
**2023-04-20:** *q4_3. Used [commit 05f3079](https://github.com/ggerganov/ggml/tree/05f307971862b83df12fada0c42ee027ba5a82b5/examples/stablelm)*
|
45 |
|
|
|
49 |
|
50 |
**2023-05-15:** *New quantization format (ggmlv2). q4_0 and q5_1, up to 2.8B. Used [commit 010203f](https://github.com/ggerganov/ggml/tree/010203f94a85df5c86b773dc5acb698c8e7b1e7b/examples/gpt-neox)*
|
51 |
|
52 |
+
**2023-05-25:** *New quantization format (ggmlv3). q4_0 and q5_1, up to 2.8B. Used [commit 73ad593](https://github.com/ggerganov/ggml/tree/73ad593cf84f864f0fcfd3a196253575c70d66a2/examples/gpt-neox)*
|
53 |
|
54 |
They're separated by date and commit so it's easier to track of any breaking changes.
|
55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
# ALTERNATIVES
|
57 |
If you're here because you want a smaller model to run on a device with constrained memory, consider the following:
|
58 |
- OpenLLaMA [3B](https://huggingface.co/openlm-research/open_llama_3b_350bt_preview) [(7B)](https://huggingface.co/openlm-research/open_llama_7b_400bt_preview)
|