Hi, I made gptq quant.
https://huggingface.co/Kotokin/sophosympatheia_Midnight-Miqu-70B-v1.0_GPTQ32G
The group size is 32, with a separation of 20.5 and 23, enough for a 32k context with an 8-bit cache.
I will add it here. EXL2 3.75 bpw https://huggingface.co/altomek/Midnight-Miqu-70B-v1.0-3.75bpw-EXL2 uploading!
Nice! Thanks, you two. I added links to the model card.
EDIT: By the way, I was able to hit 32K context with a 4.0 bpw EXL2 quant this morning without using 8-bit cache and I had VRAM to spare. 23.1/24 GB on the first card and 22.4/24 GB on the second card.
Wow, nice. Was it after fresh system restart? I can load 4 bpw exllama but only after reboot and with much shorter context. I have 40 GB VRAM only :(
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync helps you fit a tad more. I try for more bits for the buck even at the expense of context. When you chat up to 32k it gets slow anyway.
+1 to what
@jackboot
said. The backend:cudaMallocAsync setting does wonders, and I agree with the sentiment of balancing max context with bits per weight. Context is great and all, but if the model is too dumbed down to do anything useful with it, then I think you've overshot the mark.
That being said, Midnight Miqu at 4.0 bpw still performed well.
Here's what I know so far in terms of what I can fit in my 48 GB of VRAM (2 x NVIDIA 3090s, no SLI, always with a little room to spare):
- exl2-5.0bpw -- 10K context, 18K with 8-bit cache
- exl2-4.65bpw -- 20K context, 32K with 8-bit cache
- exl2-4.0bpw -- 32K context easily, 64K with 8-bit cache with room to spare. Holy cow, it works at 64K with alpha_rope 2.5 if you don't mind 0.23 tokens/s and turning your rig into a heater for your house while you're running inference. π
GPTQ just finished.. definitely free-er than other miqus. ChatML, Vicuna, Mistral formats all work. Not sure which one is the best.
Haha free-er is a good way to put it. Enjoy.
Haha, I also prefer to run larger quants, rather then have longer context. You can see my tries to fit Midnight-Rose in 40 VRAM. I made 3.75, 3.80, 3.85 and 3.9 bpw quants of it just to choose one that utilizes the maximum VRAM without causing GPU memory overflow.
BTW. @sophosympatheia big thanks for that model! It is realy great for everyday tasks and some writing! Midnight-Rose in my usage shows it is exeptionally good at text summarizations. It makes correct and detailed summaries of articles. One of the best models for this task. Midnight-Miqu looks also promising but I still need more testing to see how it works for me.
Haven't used backend:cudaMallocAsync setting yet, will see what wanders it can do for me. Thank you!
Passes the watermellon test (it puts them down after 2 or 3) but fails the javascript test.. it does try to incorporate the code mid-roleplay into the story however. maybe a larger quant will pass. On your 5bit try to ask it to give you a "hello world" in JS for a character that doesn't understand coding.
@jackboot
If I'm understanding your javascript test correctly, a character who doesn't understand coding shouldn't respond with the code, right? The test is whether the model is smart enough to attend to that detail or whether it breaks character to be helpful. That's a subtle and challenging test. I tried it at 5.0 bpw with a character who doesn't canonically know coding and the model had her produce the code consistently despite several rerolls of the answer. As you noted, it was at least creative about weaving that into the story and staying in character with the delivery.
I tried my 3.3 bpw quant of the 103B version on the same test and it wasn't any better until I added an explicit comment that the character does not know coding.
What a devious test. I like it.
Yes, many many models fail it. That "mixtral" which was 2 34b slammed together and aetheria pass it but have other issues.
BTW, difference between 5bpw and GPTQ
The perplexity for Midnight-Miqu-70B-v1.0_exl2_5.0bpw
is: 23.123085021972656
The perplexity for Midnight-Miqu-70B-v1.0_GPTQ32G
is: 23.966127395629883
Using PTB_NEW