Any chance you can start providing quants for 24GB cards?

#1
by RazielAU - opened

Any chance you can start providing IQ2_XS or IQ2_XSS GGUF quants of your 70B models for 24GB card owners? I currently use the quants released by @mradermacher , but it would be awesome to have an official version from the team.

Based on comments from @mradermacher , there is currently a known issue quantising Llama 3.1 models with llama.cpp, something around rope base which should get fixed soon, but once it is fixed, can the team start releasing official quants of the ~70B models for 24GB cards? It's reasonable to expect that people would be interested in running these models on consumer-level hardware.

NeverSleep org
β€’
edited Jul 27, 2024

His quant are okay if he do it before me, you can use them, he can be trusted.
I will do them all at once as soon as llama.cpp got fix.

I will probably do the GGUF of the Mistral (12/123B) of them today because it already work, but L3.1 will need to wait.

Thanks for your patience!

Thanks @Undi95 , that would be awesome, would really appreciate if you can do those IQ2 variants for those of us on 24GB cards :).

NeverSleep org

IQ2 variants need imatrix file, I sadly don't have the time/compute do it now, so I will upload q4/q5 and do that after.
If someone do (imatrix) quant of any of our L3.1 models now, it should work, PR got merged https://github.com/ggerganov/llama.cpp/pull/8676
So it should be okay.

Do standard Q2 variants also need an imatrix file, and how is it generated?

NeverSleep org

After an update, I saw that if you don't provide an imatrix quant when you do very low quant, llama.cpp simply refuse to do them because it will be garbage.
They throw you an error message and stop the process.
I will probably do it if nobody does as I want to run it at Q4_K_M for myself, with the best precision, so I need the imatrix haha

Why on earth would you want to run anything in IQ3 or IQ2? At that point, just use a smaller model, you'll waste much less VRAM, processing power, and pain. You'll get a much more coherent / satisfying result in return. Raw billions of parameters only scale with "intelligence" so much already at full weights. But if you turn those weights into near booleans (Q2), you're just making your life more complicated for negative benefits in return.

I run lots of models in IQ3. Smaller models suck, and totally lack coherence, and don't satisfy at all. Also, IQ2 is far from being booleans (almost 3 bit per parameter). There, be amazed by another opinion :)

@RazielAU Q2_K* don't need an imatrix, but they are quite bad without one. Q2_K_S is pretty much useless without an imatrix. Also, the rope issues should be fixed by now, but for some reason, I can't make the model load.

A 70B IQ2_XS fully offloaded on a 24 GB card is significantly better quality than using an 8B, and speed is perfectly fine. If we had newer 30B models they may be better, but none of the good models are doing it.

A 70B IQ2_XS fully offloaded on a 24 GB card is significantly better quality than using an 8B, and speed is perfectly fine. If we had newer 30B models they may be better, but none of the good models are doing it.

Plus here, almost no one cares about 30B/4x8B models anymore(

I run lots of models in IQ3. Smaller models suck, and totally lack coherence, and don't satisfy at all. Also, IQ2 is far from being booleans (almost 3 bit per parameter). There, be amazed by another opinion :)

Okay.. You got me. Fair, it's a choice between less than 10 "neurons in the list", not a bool. I was being a tad grandiose. Fine. (on average, i know it's not distributed equally, and some, more important, layers are treated better. I know my basics :p). But don't tell me that a 70B model in Q2 is worth much of anything for any job or task that goes beyond anything more complex than a one round test question, I've tried it. The moment the chat gets even slightly complex, it goes very dumb. (especially in XS mode) They can't follow basic orders anymore at that point. I can agree that I might have been over stating it for Q3 tho, but I'll definitely disagree on Q2 :) I blame that whole debate on the lack of 34B models with a modern architecture. They'd be the best compromise for 24GB cards.

I'm curious, did someone actually made something akin to open llm leaderboard, but just to 1:1 compare L3 8B and L3 70B, at all the different quantization "levels"? I know about that graph copy-pasted on every single Q'fied model. But it's not the same as a testing battery.

70B Q2_K imatrix quants work totally fine even for extended roleplaying, chatting, story writing, or jacking off, some models better, some worse. When I was a little kid, I always ran goliath in Q2_K, I was just that poor, and it was great, and back then we didn't even have those newfangled imatrix things that those young folk talk about all the time, so those were static quants. Get a perspective! :-)

More seriously, I can't agree that they are dumber. My experience is that they make more mistakes at low quants, and, sure, some models suck worse then others, but 8B models always suck in comparison, not just sometimes, but from the very beginning. For creative work, I'd always prefer a 70B IQ3_XS over a 8B f32. Just my non-humble-at-all opinionated opinion of course.

I would greatly appreciate if somebody did more testing/measuring/comparison of different quant levels and different models, though. I see very little of that.

70B Q2_K imatrix quants work totally fine even for extended roleplaying, chatting, story writing, or jacking off, some models better, some worse. When I was a little kid, I always ran goliath in Q2_K, I was just that poor, and it was great, and back then we didn't even have those newfangled imatrix things that those young folk talk about all the time, so those were static quants. Get a perspective! :-)

More seriously, I can't agree that they are dumber. My experience is that they make more mistakes at low quants, and, sure, some models suck worse then others, but 8B models always suck in comparison, not just sometimes, but from the very beginning. For creative work, I'd always prefer a 70B IQ3_XS over a 8B f32. Just my non-humble-at-all opinionated opinion of course.

There are kids in other countries that have to go to sleep without any quants at all!

I am with you that I would take a really squished large model over a full precision small model any day of the week.

To be honest, I'd probably still run golith as IQ2_XS or so nowadays.

I strongly disagree with your opinions @SerialKicked . In my experience, a quantised IQ2_XS 70B model is significantly better than a 13B model.

My favourite 13B model is Noromaid-13b-v0.3, it's by far the most impressive model NeverSleep has ever done, it's incredibly smart for such a low parameter count, and it's just really good at roleplaying. Better than v0.4, and better than any other comparable size model I've tested. With that said, I don't think it comes close to MiquMaid-v3-70B.i1-IQ2_XS (a 70B NeverSleep model quantised by @mradermacher ). I tried the new Lumimaid-v0.2-12B yesterday, and quickly switched back to MiquMaid-v3-70B.i1-IQ2_XS as it wasn't even close...

The benefits of running a large 70B model far outweighs the loss in precision from the quantisation. If there was a modern 30B model, perhaps it would be a better fit for 24GB cards, but the 30B models that exist today are so close in performance to recently released 13B models that you might as well just stick to the 13B models, which I've found to be inferior to quantised 70B models. And since everyone's either making small 7B-13B models or large 70B+ models with nothing in between, the best option ends up being the quantised 70B models.

Around processing power, this has never been a problem for me, I'm getting over 18 tokens/s on my RTX3090 on MiquMaid-v3-70B.i1-IQ2_XS, which is plenty fast for me. As for your comment around wasting VRAM, I feel your perspective is flawed, if you have 24GB of VRAM and you're only using 8GB, THAT is a waste of VRAM. It makes much more sense to get the most out of the hardware you own. I can comfortably run MiquMaid-v3-70B.i1-IQ2_XS with an 8K context, and it's perfectly stable using 23GB of VRAM. I'm getting everything I can out of the hardware I own, and therefore do not consider it a waste.

The only pain has been the wait for models to get quantised to a size (like IQ2_XS) which works on a 24GB card. Aside from that, my experience with quantised 70B models has been very positive.

Sign up or log in to comment