very slow inference speed
Did anyone try this model - TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GPTQ out on local GPU? I tried the HF transformers example as on model card with the variant gptq-3bit-128g-actorder_True on single RTX 3090, CUDA 12.3, torch 2.1.2+cu121, auto-gptq 0.7.1, optimum 1.17.1, transformers 4.38.2
It took more than 10m to produce the text (2133 chars / 384 words - including prompt). All the time VRAM consumption was about 22GB, and GPU load 90% constantly. Is that not abit too slow?
@tunggad huggingface transformers is not that fast so speed will be pretty slow. Use exllama or exllama v2 for faster inference. I would actally reccomend you use 3bpw exl2 mixtral and load it with exllama v2 and you will get much faster speed.(50 tokens per second around?)