TQ1 quant version

#7
by TobDeBer - opened

Please add a TQ1 version. This should be just around 400MB in size.

On a related note: Why are existing bitnet GGUF so large?
1840MB for 2740M parameters is 5.37 bpw! Far from the claimed 1.58 bpw.

Why are existing bitnet GGUF so large?

I think it's because the token embeddings tensor is kept in F16 in the i2_s model, while with TQ1_0 and TQ2_0 usually it's stored as either Q4_K or Q6_K.

Why are existing bitnet GGUF so large?

I think it's because the token embeddings tensor is kept in F16 in the i2_s model, while with TQ1_0 and TQ2_0 usually it's stored as either Q4_K or Q6_K.

Yes I noticed that as I was converting while testing it, but it is also the output.weight that is normally reduced as it happened automatically when converting.

[   1/ 333]                        output.weight - [ 2560, 128256,     1,     1], type =    f16, converting to q6_K .. size =   626.25 MiB ->   256.86 MiB
[   2/ 333]                    token_embd.weight - [ 2560, 128256,     1,     1], type =    f16, converting to iq4_nl .. size =   626.25 MiB ->   176.13 MiB
[...]
llama_model_quantize_internal: model size  =  1751.06 MB
llama_model_quantize_internal: quant size  =   934.16 MB

Edit: It seems these two tensors are identical

I uploaded versions for ik_llama.cpp here: https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF that are smaller due to the changes above.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment