Inference is very slow (about 3 secs/token)
Great to have this model in HF! The inference is super slow - makes it hard to do real-time experiments. Can this be sped up easily?
As measured on Windows 11, CPU: i9-13900KF, 128 GB RAM, GPU: RTX 3090 (24 GB).
use a quant. Which don't exist yet....
@rfernand
your best bet is to use quantization and that should boost speed by a large amount and also it will take up less vram. I think you should use the gptq quant format and load it with huggingface to get best speed. Although transformers is somewhat simple, using something like exllama v2 should get you the fastest speed.
https://huggingface.co/TheBloke/Orca-2-13B-GPTQ
Use the 8 bit one for maximum quality
heh yeah and now they do exist ;)
Thanks @YaTharThShaRma999 and @PsiPi .
This is great - I tried the 4-bit version (https://huggingface.co/TheBloke/Orca-2-13B-GGUF) with following results:
model loading: 4x faster
inference 12x faster
TLDR
- pip install ctransformers[cuda]
- python script for inference:
from ctransformers import AutoModelForCausalLM
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Orca-2-13B-GGUF", model_file="orca-2-13b.Q4_K_M.gguf", model_type="llama", gpu_layers=50)
print(llm("AI is going to"))
Yeah LoneStriker offers an excellent version as well
For inference, I get the following error:
`GLIBC_2.29' not found
Anyone know how to resolve this?
Specifically
[`GLIBC_2.29' not found](oserror: /lib64/libm.so.6: version `glibc_2.29' not found (required by /local/home/user_name/anaconda3/envs/odi-ds/lib/python3.9/site-packages/ctransformers/lib/cuda/libctransformers.so))
Thank you for replying, I think I have the right glib now but now everytime I run the code on jupyter my kernel just dies as soon as I try to download the model from the repo.
wait nevermind the last comment, all good