This is cooked

#2
by supercharge19 - opened

This model is cooked, over cooked actually. Could you try dynamic quantization or or is it already the best possible quantization for this class (1.58)?

Also, what is the best quantization for 1b instruct model. Could you please also try 4bit but dynamic as suggested by unsloth?

Technology Innovation Institute org
β€’
edited Dec 20, 2024

Hi @supercharge19
1.58 BitNet models are a complete separate type of models. As you can see from the performance section of the model card, we do not pretend this model to be a SoTA model, it's an open model for research purpose. This model has been built on based on recent research around 1.58bit models: https://huggingface.co/blog/1_58_llm_extreme_quantization / https://huggingface.co/papers/2402.17764 (more to come in the upcoming technical report) - 1.58bit models can be really exciting for the future as they can provide extreme compression rate coupled with the fact that multiplications wouldn't be required to run these models (except for the LM head) - if we demonstrate in the future that we can get very competitive 1.58bit models it would create exciting opportunities (e.g. creating specialized "almost-matmul-free" hardware (since LM head would require matrix multiplications))
Feel free to read more about it by reading over the resources shared above and: https://github.com/microsoft/BitNet
You can of course quantize the original model (tiiuae/Falcon3-10B-Instruct) using mature methods such as 4-bit bitsandbytes and any other quantization scheme supported here: https://huggingface.co/docs/transformers/quantization/overview as the architecture is llama based and will give you much superior performance than this model

Thank you for responding.
I thought bitNet was trained from scratch but not with f16 or other precision, rather with 0s and 1s (or if I'm not recalling incorrectly -1, 0,1), so its generation quality did not suffer, as network learned connections differently from if they were trained on 16bit and then precision was thrown away with 1.58 quants.
I'm not saying it is like BitNet, I was just hoping that dynamic quantization method was used as suggested by unsloth in their blog: https://unsloth.ai/blog/dynamic-4bit (I was hoping that most layers retained generation quality despite being quantized or were not quantized if quality were to suffer).

I am just wondering if it is even possible to quantize a model to 1bit or even sub-bit quants while keeping same or close to original quality. Or we can't go lower than 4bit?

Technology Innovation Institute org

Thank you @supercharge19 !
BitNet is indeed trained from scratch and the pre-training is done in bf16, the ternary format quantization is done post-training. For this model, we leverage the idea behind https://huggingface.co/blog/1_58_llm_extreme_quantization by performing 1.58bit fine-tuning, i.e., we start from a bf16 checkpoint (in this case the Falcon3-10B-Instruct model) and perform 1.58bit fine-tuning on top of it. I think dymanic quantization in 1.58 bit format (ternary format) remains an unexplored field for now - regarding "sub-4bit" quantization methods from what I know there are methods like HQQ - or AQLM which Falcon-10B-Instruct would definitely be compatible

Sign up or log in to comment