Perfomance of the model

#1
by thefaheem - opened

I Wonder If There any Perfomance Drop in ggml version of the model compared to HF version?

You mean performance in terms of accuracy of answers? Or in terms of how fast it responds?

In terms of speed of response: this depends. HF models load on GPU, and GPU is much faster for inference than CPU. But you also need a lot of VRAM, and the model is very large.
GGML runs pretty well on CPU, and if you don't have a powerful GPU it is likely to be quicker. But if you do have a powerful GPU, then HF may be quicker - especially in 8bit or 4bit. 4bit quantised GPU inference (GPTQ) is likely to be fastest of all.

If you meant in terms of accuracy of answers, then in theory there may be a small drop in theoretical accuracy, as GGML is quantised. But if you use q5_0 or q5_1 it is likely to be impossible to notice.

If you look at the README for llama.cpp there is a table showing benchmark results for 7B float16 (like HF), vs the various 4bit and 5bit GGML quantisation methods.

If you use the q5_1 file the difference to float16 is tiny, like less than 0.1%. So it is very unlikely you will notice any drop in accuracy in practical usage.

Ahh.. Thanks For The Detailed Response!

thefaheem changed discussion status to closed

Sign up or log in to comment