Heads up: Broken Mistral Nemos 12B / other sizes.
I was doing some testing of new Mistral Nemo "MOEs" today and ran into a oh crap... situation.
Here it is:
- min viable quant for a Mistral Nemo (regardless of parameters) seems to be IQ4_XS / Q4_K_S.
- Imat: IQ3_XXS (min, IQ3_S better)
Q2k (non / imat) are borked. Completely, totally BORKED.
Q3s barely operated / are dumb.
Suggest Q5 min for quality / Q6.
This applies to 12B mistral nemo, MOES of Mistral Nemo, and larger Mistral Nemos as far as I can tell.
I tested a number of them.
Maybe this is why they dropped max context to 128k for other newer mistrals?
NOTE:
Older Mistrals (7b / 32k context) operate at IQ1_S/M ; same for MOEs of this Mistral type.
That's not good... Interesting how it would affect as whole family.
Clarify:
By larger Mistral Nemos I mean stackers like my 23B / 23.5B Grand Gutenbergs.
Going to update my "MN" model cards tomorrow. ERRRR.
:) We're ready, although hf has severely limited out quantize capabilities :(
OOo... storage limits? other?
Early am here, brain not fully engaged.
Going to see if I can "hack" a solution for MNs.
I almost threw 3 perfectly viable MN MOEs in the trash because q2k did not work - a case of : "It really is YOU, not ME".
This issue does affect Llama 3,3.1,3.2 - but to a WAY smaller, if negligible level.
This tracks with my low quant investigations => More context, you need a higher quant to operate.
Makes sense
-> You need greater BPW, for more nuance, and the model needs more nuanced understanding at higher context limits.
Rate limited the api so that our repository creation was too fast and they blocked us for hours at a time, very little progress possible. The problem is that every upload tries to create the repository, and that call seems to have a very low rate limit. We have worked aorund it by uploading small quants in batches, which has improved the situation. But doing hack modifications in a running system with this throughput is not fun, especially if you want to go to sleep and have to wait for a few jobs to finish so you see if it works or not. Sigh.
The only real disappointment is that I mailed huggingface about it, and they didn't even bother with a reply.
As for your problem, is it specifically a problem for moes? Would make sense, as quantisation affects smaller models worse.
Hmm ; mini nightmare at this end uploading source files ; really bad the last week or so.
Could be everyone shopping online!?!
Had HF api uploaders (for source) crash/break several times.
I have mine set with 3 duplicates per 10 source files upload - always seemed to work ... not this past week.
Sleep? What is that? ...
Seems to affect Mistral Nemos - MOE or non-moe.
Q2k quant is junk ; been trying different ways to "raise it up" - helped a bit with more BITS for output weights / embed weights.
(--output-tensor-type Q8_0 --token-embedding-type Q8_0 )
Q3s not much better.
Basically it means the "Floor" for a Mistral Nemo (12B model) quant is Q4KS at "min" , instead of "recommended".
Ahh... "Q2K" is "freaking desperate, I am dying of thirst" .
Issue is both instruction following / output generation.
Instruction: Not following correctly / missing instructions.
Output: Generation crashes -> Para repeats (maybe fix with "DRY" sampler) , but worse - word repeats => BOOM.
Sentence structure is poor / awful too.
This is for MOE (2x, 4x) and reg 12B Mistral Nemos.
My MN stackers seem to fair a bit better ; 23B, 23.5B , 29B.
I am still running tests at this point , going to try "older" llamacpp too.
UPDATE:
It seems generating the "main.gguf" (to quant all the others) at F32 (vs f16,bf16) with --output-tensor-type Q8_0 --token-embedding-type Q8_0 during quantize works.
For reference, the output tensor / token embed type Q2K is "Q6_k" and "Q2_K" respectively.
Q2k isn't perfect, but it does operate fairly well.
Going to test on other quants and do some measurements.
Had HF api uploaders (for source) crash/break several times.
Never seen anything else.
Sleep? What is that? ...
People who don't sleep enough increasingly sound like you. Sufficient sleep is important.
Sleep has never been one of my strong suits. ;
That being said, I am aware the tin foil hat gets tighter, on little sleep.
More testing today and options on revise quant options for MN models (and other archs too).
Did some testing on my "Multiverse MOE 4X7B" - q2k ; (7B mistral models, 32k context).
Some uptick in performance. However PPL and general stability of these "old 7Bs" are remarkable.
Looking at crafting "Super 6" and "Super 8" [F16/BF16] on newer archs / archs with higher context levels to measure effects.
Likewise looking at Imatrix end too ; with upping the output/embed settings "just a bit" to see if it fixes some issue at low end.
IE: Instruction following is the first issue that appears at low end quants, and based on some test results, increasing the bits
for embed/output (most likely output!) addresses these issues to a degree.
Interestingly: q8/q8 for output / embed is not always the best choice depending on USE CASE.
Update:
There seems to be an improvement when "f32"ing the "main.gguf" to make all the other quants with MOES.
Not clear why yet.
Usually this just pads everything if the source is f16/bf16, unless the source itself is in f32.
Maybe something to do with MOE quanting ?
Going to test a slew of MOEs to see of the F32 issue is MN related, or across the board.
RE: MN
Did more testing.
Massive improvement in MN (normal, NOT MOEs unless you F32 them) with newest LLAMACPP and re-quanting.
This also extends to all Llamas, Gemmas and Mistrals.
Here is how big:
Models I could not release due to difficulty in "operation", (lots of creativity/prediction breaking W word repeat, para and other issues ) now operate almost perfectly.
This tracks with overall improvements in all quants, magnified at the low end / models that are harder to use.
On a side note:
New quants in terms of PPL actually test somewhat worse, but operate better.
That being said, I am aware the tin foil hat gets tighter, on little sleep.
I must remember that, that's a splendid way of putting it :)
f32
I don't know what llama.cpp does (and it is a moving target), but it could depend on various fatcors, for example, f32 could force the imatrix calculations do to be done in full precision (and we currently force the use of mmq for that reason as well, so that shouldn't apply to our imatrix, unless I misunderstand how things are implemented). I would be surprised if it affects the quantisation itself though. Often, the fact that f16 cannot represent a lot of numbers that bf16 or f32 can (e.g. it overflows easier) can make a difference - that's why bf16 is sometimes preferred - it has far worse precision, but much greater range than f16.
How that would figure in, I have no clue. However, you should also always make sure your testing is bias-free, objective and non-random. I've had really bad months with some models, for example, but it's the same model. and I see so many people that swear on certain models, and much later admit it was just a lucky test. I always remember when Alan Cox said "it's amazing what you find out when you actually measure things" (and measuring llms is especially difficult in itself).
I cna easily believe that llama.cpp has fixed a lot of imatrix and quanision-related bugs in recent months, though, and nicboss can, by now, even back it up with objective measurements :) I plan to introduce static iq3 quants again, even...
Models I could not release due to difficulty in "operation", (lots of creativity/prediction breaking W word repeat, para and other issues ) now operate almost perfectly.
I think at most of that might have been some severe regressions inside llama.cpp. But I also have seen the same quant perform dadaism one day, and perform perfectly acceptable the other, with same or slightly different settings. They are sensitive beasts.
I did a number of tests today on MOES , comparing F16/BF16 AND merge build @ f32 AND master "gguf" f32 for quanting.
I tested a "moe" merges and gguf at bf16/f16 against a "moe merge" done at f32/gguf master at f32.
There is definitely a difference.
Look like the "moe layers" benefit from a bit more precision, when compressed from F32.
I also did the same with "regular models", IF there was F32 component(s) (RARE!) OR IF they were a merge with "lots of math" (IE Dare ties, task art... basically anything except pass-through).
In all cases there was a jump in creativity, detail, and general performance.
Sometimes slight, sometimes stark.
Ran tests at Temp=0 and at temp using a number of different prompts.
Then I augmented the quants themselves with higher output tensor/embed settings.
IE Q8/Q8 ; f16, bf 16, and f32 (for the f32 quanted models).
I played with a lot of different mixes.
(did this for "moe" models and reg models)
Then did a special one: bf16/bf16 (Q2k, IQ4XS,Q6,Q8) -> This forces the OT's "math" onto the CPU. Cpu math is a bit more accurate => Improvement there too.
(Side effect: Less vram usage, more for context but lower T/S speed).
And of course a "MAX" Q8_0 -> Both at f16 (9.5 BTW average), or both at f32 (10.83 BPW average)
Went a little crazy, and updated/uploaded 4 repos with new improved AND augmented quants.
Adjusted the mixture on all quants too (IE OT is set at Q8 for all min), plus special quants added too.
New llamacpp AND augmented quants (f16/bf16) - (notice all the quants!) :
https://huggingface.co/DavidAU/Gemma-The-Writer-N-Restless-Quill-10B-Uncensored-GGUF
https://huggingface.co/DavidAU/L3-Dark-Planet-8B-GGUF
https://huggingface.co/DavidAU/L3-DARKEST-PLANET-16.5B-GGUF
And full F32 merge (high precision dare ties) => master gguf at f32 => "reg" quants masters from F32 AND augmented quants :
https://huggingface.co/DavidAU/Gemma-The-Writer-Mighty-Sword-9B-GGUF
(still uploading! haha)