Any plan for 8bit version?

#1
by jm4n21 - opened

Hi! Thank you very much for sharing. Is there any plan for an 8-bit version? Many thanks: )

ConfidentialMind org

Thanks for taking an interest! @jm4n21

NM already got there first: neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic

Unless you mean int8 (w8a16); do those run well? Afaik Machete kernel from vLLM is for Hopper architectures (handles w8a16), but Hopper doesn't do too well with int quants.

Further floating 8bit point quants:

There's static integer W8A8 quants too: noneUsername/Mistral-Small-24B-Instruct-2501-W8A8

Or another dynamic quant (should perform better, but slower): EliasOenal/Mistral-Small-24B-Instruct-2501-W8A8-dynamic

But let me know if you meant int8 weights with bf16/f16 activations - I'll make one next weekend.

Hi @JustJaro

Thank you for your detailed suggestions.

The reason I was asking is that FP-dynamic models are typically loaded via vLLM. However, I have specific reasons for preferring Higginface’s loading method, hence, GPTQ models, which are probably more suitable for my use case.

Please let me know if I’ve misunderstanding anything

Many thanks for your help :)

Sign up or log in to comment