Is the Q8_0 quant also imatrix'd? Why?

by igzbar - opened May 2, 2024

May 2, 2024

What was the basis of the decision to use imatrix vs. regular quantization for Q8_0? Doesn't imatrix reduce performance?

bartowski

Owner May 2, 2024

•

edited Nov 3, 2024

It shouldn't reduce performance (unless you have a source on that) but it also should not affect it much if at all, since at Q8 there's no need to compress portions further than others

Edit: this is based on old knowledge, if you come across this the real answer is that Q8 quants completely disable the imatrix, as you can see here:

https://github.com/ggerganov/llama.cpp/blob/8b3befc0e2ed8fb18b903735831496b8b0c80949/ggml/src/ggml-quants.c#L3303

mradermacher

Nov 3, 2024

Was just pointed at this discussion - imatrix data is not used for Q8_0 quants, so the resulting quant will be essentially the same regardless of whether an imatrix is specified or not (the only thing that changes is the header values saying which imatrix file is used etc., not the actual model data). Q6_K is the highest-bpw tensor format that uses the imatrix data for quantisation.

bartowski

Owner Nov 3, 2024

•

edited Nov 3, 2024

Yeah this is a very old comment for me, I learned since then that in the code itself on Q8 it explicitly disabled the imatrix before doing the quantization

updated the comment in case others come across it

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment