Add F16 and BF16 quantization

#129
by andito HF Staff - opened
No description provided.
ggml.ai org

The problem with adding BF16 is that current we use convert_hf_to_gguf.py to convert HF model into F16, then use llama-quantize to quantize it.

So the conversion will be safetensors --> F16 --> BF16 which adds no benefit to the output model.

What we should do here is also modify the code that run convert_hf_to_gguf.py, so it outputs directly BF16 GGUF file

I would also like to see a direct conversion into 'Floating Point 16', especially now that I know how 'BrainFloat16' works.

TL;DR Basically, BF16 sucks.

Cannot merge
This branch has merge conflicts in the following files:
  • app.py

Sign up or log in to comment