TQ1 quant version
Please add a TQ1 version. This should be just around 400MB in size.
On a related note: Why are existing bitnet GGUF so large?
1840MB for 2740M parameters is 5.37 bpw! Far from the claimed 1.58 bpw.
Why are existing bitnet GGUF so large?
I think it's because the token embeddings tensor is kept in F16
in the i2_s
model, while with TQ1_0
and TQ2_0
usually it's stored as either Q4_K
or Q6_K
.
Why are existing bitnet GGUF so large?
I think it's because the token embeddings tensor is kept in
F16
in thei2_s
model, while withTQ1_0
andTQ2_0
usually it's stored as eitherQ4_K
orQ6_K
.
Yes I noticed that as I was converting while testing it, but it is also the output.weight that is normally reduced as it happened automatically when converting.
[ 1/ 333] output.weight - [ 2560, 128256, 1, 1], type = f16, converting to q6_K .. size = 626.25 MiB -> 256.86 MiB
[ 2/ 333] token_embd.weight - [ 2560, 128256, 1, 1], type = f16, converting to iq4_nl .. size = 626.25 MiB -> 176.13 MiB
[...]
llama_model_quantize_internal: model size = 1751.06 MB
llama_model_quantize_internal: quant size = 934.16 MB
Edit: It seems these two tensors are identical
I uploaded versions for ik_llama.cpp here: https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF that are smaller due to the changes above.