oh you were the guy
so you did apparently use lora weights in 4int quantization
Yes this uses LoRA as mentioned in the README. I am currently working on testing LoRA on 30B to see if there is an improvement. I will upload it in int4 if there is.
Have not done quantization before, but to my understanding there is a script which can properly be adjusted to work with any model that works with huggingface transformers. I think galactica 30b would also be very handy to have. Do you think it is possible to quantize? Does quantization happen in RAM or in VRAM?
Thinking the same about nllb-moe-54b
RAM and VRAM. Approx 90GB RAM at peak and 10GB vram (30b) but you can also use swap. If you link me those models I can look into them but if there's no GPTQ conversion script I'd have to create my own.
Good to know, thanks. That's thus easily possible on a beefy workstation.
There is no script yet for them. As far as I understand, the script can be adjusted for huggingface / transformer lib models.
The links are:
https://huggingface.co/facebook/nllb-moe-54b
https://huggingface.co/facebook/galactica-30b
The former is huge. Galactica would be very handy to fit in VRAM. Currently for a lot of setups that's not possible.
nllb is more like an asset. Currently sota for translation work, but it is unusably huge.
If you modify the conversion script to support them (individually) then I can run it on my server. Don't really have the time to dig into the scripts and changing them around right now.