EXL2 quants of gemma-2-27b-it
My quants are meant to be a tight fit in 24 GB VRAM. Following VRAM usage numbers assume 8k context.
bpw | head | 4 bit cache | 16 bit cache | Notes |
---|---|---|---|---|
5.8 | 8 bit | 21.85 GB | 23.69 GB | 16 bit cache, but lower BPW |
π 6.5 | 8 bit | 23.81 GB | 25.65 GB | π my recommendation |
6.6 | 6 bit | 23.86 GB | 25.70 GB | slightly higher BPW, but less precise head |
For this model the difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body.
Check out turboderp's quants & measurement.json:
3.00 bits per weight
3.50 bits per weight
4.00 bits per weight
4.50 bits per weight
5.00 bits per weight
6.00 bits per weight
8.00 bits per weight
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
HF Inference deployability: The model has no library tag.