Where GGUF?
Yeah would be great if this could be applied to GGUF or EXL2 quantisation, GPTQ isn't very widely used anymore.
Hadn't used gptq for like a year. When I tried this in the latest ooba, it produced garbage characters lol
I am looking for the instruction and will give a try.
If anyone know how to transfer GPTQ models to GGUF or EXL2, please give me a help, Thank you!
Hmm, I wonder if there's a more/most-est universal format to transfer to? I've been trying to figure out how to run this with Silicon/Metal/MPS. The closest thing I'd found thus far is Mistral.rs -- the dev is lightning quick with updates it seems, and recently added in some GPTQ-type support FYI. IIRC, one would be able to convert the GPTQ model to GGUF/GGML, but if not, one can definitely run a GPTQ-quantized model on Mistral.rs, 2-bit, odd-bit, no problem...plus, it is Rust, which in itself means some boosts to performance/reliability!
Sadly, that support doesn't translate over to Metal just yet.
Metal/MPS can run w/ a Triton kernel now, so that is also another possibility if you really wanna open up your wonderful, awesome-sauce quants. to everyone ππ€£ (just spitballing, was glazing over all this a week ago at this point, but I believe that is valid avenue as far as GPTQ-compatibility goes)
@BuildBackBuehler
Thanks for your interesting.
Recently, T-MAC from MicroSoft have support to run quantized model of EfficientQAT. Additionally, the reported speed is even faster than llama.cpp.
Hello @ChenMnZ ,
Can we run this quantized model with T-MAC or Mistral.rs? Have you tried them with this 2-bit model? Thanks!
@MLDataScientist The owner of T-MAC have tried this. You can refer https://github.com/OpenGVLab/EfficientQAT/issues/3#issuecomment-2298608707 for details.