ConfidentialMind/Mistral-Small-24B-Instruct-2501_GPTQ_G128_W4A16_MSE

jm4n21

11 days ago

Hi! Thank you very much for sharing. Is there any plan for an 8-bit version? Many thanks: )

JustJaro

ConfidentialMind org 5 days ago

Thanks for taking an interest! @jm4n21

NM already got there first: neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic

Unless you mean int8 (w8a16); do those run well? Afaik Machete kernel from vLLM is for Hopper architectures (handles w8a16), but Hopper doesn't do too well with int quants.

Further floating 8bit point quants:

There's static integer W8A8 quants too: noneUsername/Mistral-Small-24B-Instruct-2501-W8A8

Or another dynamic quant (should perform better, but slower): EliasOenal/Mistral-Small-24B-Instruct-2501-W8A8-dynamic

But let me know if you meant int8 weights with bf16/f16 activations - I'll make one next weekend.

jm4n21

2 days ago

Hi @JustJaro

Thank you for your detailed suggestions.

The reason I was asking is that FP-dynamic models are typically loaded via vLLM. However, I have specific reasons for preferring Higginface’s loading method, hence, GPTQ models, which are probably more suitable for my use case.

Please let me know if I’ve misunderstanding anything

Many thanks for your help :)

ConfidentialMind
/

Mistral-Small-24B-Instruct-2501_GPTQ_G128_W4A16_MSE

Any plan for 8bit version?