Add imatrix computing tips
#746
by
treehugg3
- opened
README.md
CHANGED
@@ -142,6 +142,35 @@ and then run another command which handles download/computation/upload. Most of
|
|
142 |
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
|
143 |
is unfortunately very frequent).
|
144 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
145 |
## Why don't you use gguf-split?
|
146 |
|
147 |
TL;DR: I don't have the hardware/resources for that.
|
|
|
142 |
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
|
143 |
is unfortunately very frequent).
|
144 |
|
145 |
+
## What do I need to do to compute imatrix files for large models?
|
146 |
+
|
147 |
+
Use [`llama-imatrix`](https://github.com/ggml-org/llama.cpp/blob/master/examples/imatrix/README.md) to compute imatrix files.
|
148 |
+
|
149 |
+
### Hardware
|
150 |
+
|
151 |
+
* RAM: A lot of RAM is required to compute imatrix files. Example: 512 GB is just enough to compute 405B imatrix quants in Q8.
|
152 |
+
* GPU: At least 8 GB of memory.
|
153 |
+
|
154 |
+
### Dataset
|
155 |
+
|
156 |
+
* You want to create a dataset that is around double the size of bartowski1182's imatrix dataset. Quality is far more important
|
157 |
+
than size. If you don't mind long training times, you can make it massive, but if you go beyond 1 MB there will
|
158 |
+
probably be diminishing returns.
|
159 |
+
* Your imatrix dataset should contain the typical output the model would generate when used for the workload you plan on using
|
160 |
+
the model for. If you plan on using the model as a programming assistant, your imatrix dataset should contain the typical code
|
161 |
+
you would ask it to write. The same applies for language. Our dataset is mostly English. If one would use our imatrix models in
|
162 |
+
a different language they will likely perform worse than static quants as only a very small portion of our imatrix training data
|
163 |
+
is multilingual. We only have the resources to generate single generic imatrix quants so our imatrix dataset must contain examples
|
164 |
+
of every common use-case of an LLM.
|
165 |
+
|
166 |
+
### Extra tips
|
167 |
+
|
168 |
+
* Computing 405B imatrix quants in Q8 does not seem to have any noticeable quality impact compared to BF16, so to save on hardware
|
169 |
+
requirements, use Q8.
|
170 |
+
* Sometimes, a single node may not have enough RAM to compute the imatrix file. In such cases, `llama-rpc` inside llama.cpp can
|
171 |
+
be used to combine the RAM/VRAM of multiple nodes. This approach takes longer: computing the 405B imatrix file in BF16 takes
|
172 |
+
around 20 hours using 3 nodes with 512 GB, 256 GB, and 128 GB of RAM, compared to 4 hours for Q8 on a single node.
|
173 |
+
|
174 |
## Why don't you use gguf-split?
|
175 |
|
176 |
TL;DR: I don't have the hardware/resources for that.
|