Would also be helpful for gpt4all, since Q4_0, Q4_1, and FP16 are our only options there.
Venketh
venketh
AI & ML interests
None yet
Recent Activity
updated
a model
about 1 month ago
venketh/Qwen2.5-Coder-14B-Instruct-gguf
updated
a model
about 1 month ago
venketh/Qwen2.5-Coder-7B-Instruct-gguf
Organizations
None yet
venketh's activity
Upload Mistral-Nemo-Instruct-2407-Q4_0.gguf
6
#5 opened 6 months ago
by
venketh
reacted to
bartowski's
post with πβ€οΈ
6 months ago
Post
10074
So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
IQ1_S or IQ_M for low RAM/VRAM computers
12
#20 opened 10 months ago
by
teneriffa
Were all the quantizations produced w/ importance matrices?
7
#19 opened 10 months ago
by
venketh
How did you train / FT this?
#2 opened over 1 year ago
by
venketh