How good is the gguf?
I generated these to try to run them on my home box (5x 3090/4090) but have not managed to run it yet. The previous DeepSeek models were fantastic for coding, so I expect these to be better. But, given the I'll likely only be able to run a ~3.0 bpw version of the model, not sure it'll be competitive with GPT-4o when quantized that low. I'll report back if I have any success.
I've tried Q2_K on 64GB ram + 24GB vram and ran just a few test prompts to compare with Codestral. Deepseek Coder V2, even quantized down to Q2_K, was better. Speed was abysmal at under 1t/s but it's a 236B model and my gpu shared memory was probably getting trashed as I was running out of all memories lol. If you have the means to run it, do it, it's great.
A bit over 3t/s here with 128GB ram and 10~14 layers to GPUs. (Q3_K_L). Theoretically imats should be better, but I also read that they'd be slow if not fully offloaded to VRAM, but that doesn't seem to be the case with these MoE models (or something else?). (I dreamt it; it's slow).
Edit: didn't dream it but I know nothing. After letting llama-cli generate for a few minutes at like 0.5 t/s, all of the sudden it speeds up to like 4 t/s. It even seems to generate coherent output. This is with iQ4K_M.
IQ4_XS seems to be the sweet spot for me as that works with mlock. I'm curious about the ppl difference between that one and the Q3_K_L.
Ran perplexity, which should apparently be taken with a grain of salt:
Q3KL: Final estimate: PPL = 5.2010 +/- 0.03077
IQ4XS: Final estimate: PPL = 5.0772 +/- 0.03002