TQ1 quant version

by TobDeBer - opened 5 days ago

5 days ago

•

Please add a TQ1 version. This should be just around 400MB in size.

On a related note: Why are existing bitnet GGUF so large?
1840MB for 2740M parameters is 5.37 bpw! Far from the claimed 1.58 bpw.

compilade

5 days ago

Why are existing bitnet GGUF so large?

I think it's because the token embeddings tensor is kept in F16 in the i2_s model, while with TQ1_0 and TQ2_0 usually it's stored as either Q4_K or Q6_K.

tdh111

5 days ago

•

edited 5 days ago

Why are existing bitnet GGUF so large?

I think it's because the token embeddings tensor is kept in F16 in the i2_s model, while with TQ1_0 and TQ2_0 usually it's stored as either Q4_K or Q6_K.

Yes I noticed that as I was converting while testing it, but it is also the output.weight that is normally reduced as it happened automatically when converting.

[   1/ 333]                        output.weight - [ 2560, 128256,     1,     1], type =    f16, converting to q6_K .. size =   626.25 MiB ->   256.86 MiB
[   2/ 333]                    token_embd.weight - [ 2560, 128256,     1,     1], type =    f16, converting to iq4_nl .. size =   626.25 MiB ->   176.13 MiB
[...]
llama_model_quantize_internal: model size  =  1751.06 MB
llama_model_quantize_internal: quant size  =   934.16 MB

Edit: It seems these two tensors are identical

tdh111

5 days ago

•

edited 5 days ago

I uploaded versions for ik_llama.cpp here: https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF that are smaller due to the changes above.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment