Promising looking results on 24GB VRAM folks!

#3
by ubergarm - opened

Good threat with some MMLU-Pro benchmarks over on r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1fkm5vd/comment/lnxcfzg/?context=3

I still want to find the "knee point" between the 72B and 32B quants...

A summary of Qwen2.5 Models and Parameters performance on MMLU-Pro Computer Science benchmark as submitted by redditors over on u/AaronFeng47 great recent post.

Model Parameters Quant File Size (GB) MMLU-Pro Computer Science Source
14B ??? ??? 60.49 Additional_test_758
32B 4bit AWQ 19.33 75.12 russianguy
32B Q4_K_L-iMatrix 20.43 72.93 AaronFeng47
32B Q4_K_M 18.50 71.46 AaronFeng47
32B Q3_K_M 14.80 72.93 AaronFeng47
32B Q3_K_M 14.80 73.41 VoidAlchemy
32B IQ4_XS 17.70 73.17 soulhacker
72B IQ3_XXS 31.85 77.07 VoidAlchemy
Gemma2-27B-it Q8_0 29.00 58.05 AaronFeng47

References

https://www.reddit.com/r/LocalLLaMA/comments/1fkm5vd/comment/lnxcfzg/
https://www.reddit.com/r/LocalLLaMA/comments/1flfh0p/comment/lo7nppj/

Man if we could find a way to distribute MMLU pro testing.. that would be so cool haha.

A summary of Qwen2.5 Models and Parameters performance on MMLU-Pro Computer Science benchmark as submitted by redditors over on u/AaronFeng47 great recent post.

Model Parameters Quant File Size (GB) MMLU-Pro Computer Science Source
14B ??? ??? 60.49 Additional_test_758
32B 4bit AWQ 19.33 75.12 russianguy
32B Q4_K_L-iMatrix 20.43 72.93 AaronFeng47
32B Q4_K_M 18.50 71.46 AaronFeng47
32B Q3_K_M 14.80 72.93 AaronFeng47
32B Q3_K_M 14.80 73.41 VoidAlchemy
32B IQ4_XS 17.70 73.17 soulhacker
72B IQ3_XXS 31.85 77.07 VoidAlchemy
Gemma2-27B-it Q8_0 29.00 58.05 AaronFeng47

References

https://www.reddit.com/r/LocalLLaMA/comments/1fkm5vd/comment/lnxcfzg/
https://www.reddit.com/r/LocalLLaMA/comments/1flfh0p/comment/lo7nppj/

Q3 better than Q4 ??? Craziness

Q3 better than Q4 ??? Craziness

Yeah, some interesting results for sure. Though, I personally take all these benchmarks with a big grain of salt then throw some salt over my shoulder for good measure too... lol...

It is just a single benchmark with 410 questions that allows random guesses so hard to extrapolate if Q3 would actually be better on your exact data set than Q4 etc.

Fun stuff though, and Ollama-MMLU-Pro makes it easier than ever to try it yourself!

Man if we could find a way to distribute MMLU pro testing.. that would be so cool haha.

Huh, seems like the MMLU-Pro Leaderboard does have a Submit here button for Self-reported results?! Kind of cool, curious to see if this gets picked up and if it is automated or what...

Q3 better than Q4 ??? Craziness

bartowski's own notes on the main page here seems to have different recommendations. this has an imartix, whereas the OP in that reddit post used Ollama models which are directly from the qwen original source, so its possible that things change with that difference (but I bet this imatrix is not multilingual)

@ubergarm yeah that's cool but it would be even cooler if there was a way to just add your GPU to a pool, similar to distributed training or crypto, and then have official MMLU-pro run against the pool and compile the results so everyone is just running a few iterations each

@robbiemu that's true, but also Q3 can out perform Q4 in benchmarks without being overall better, just by chance or by fortunate pruning

@robbiemu that's true, but also Q3 can out perform Q4 in benchmarks without being overall better, just by chance or by fortunate pruning

hi @bartowski , thanks -- that's cool, so its just the one "comp sci" measure in mmli-pro, but across the whole bench it probably doesn't remain so high.

If I can ask, where do your recommendations in the model page come from?

Purely from a bits per weight and general feeling that has been established over time, generally speaking it's been accepted to aim for above 4 bits per weight (Q4_K_M+) but models tend to stay surprisingly coherent all the way down the Q2

Sign up or log in to comment