Experimental quants of 4 expert MoE mixtrals in various GGUF formats.
Original model used for custom quants: NeverSleep/Mistral-11B-SynthIAirOmniMix
https://huggingface.co/NeverSleep/Mistral-11B-SynthIAirOmniMix
Goal is to have the best performing MoE < 10gb
Experimental q8 and q4 files for training/finetuning too.
No sparsity tricks yet.
8.4gb custom 2bit quant works ok up until 512 token length then starts looping.
- Install llama.cpp from github and run it:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
wget https://huggingface.co/nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf
./server -m 4mixq2.gguf --host "my.internal.ip.or.my.cloud.host.name.goes.here.com" -c 512
limit output to 500 tokens
- Downloads last month
- 1,137