why TheBloke/guanaco-65B-GPTQ run slow even on 80GB GPU
the code i tried is given below...
even tried with langchain still it's too slow,,,,
can you tell me how to make this model work faster. ........
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
model_name_or_path = "TheBloke/guanaco-65B-GPTQ"
model_basename = "Guanaco-65B-GPTQ-4bit.act-order"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
prompt = "who is the first president of US"
prompt_template=f'''### Instruction: {prompt}
Response:'''
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
print(pipe(prompt_template)[0]['generated_text'])
Thankyou;
hlo bro, can you please tell me how to make it fast
Hey Bloke, can please give a clarification
It may be because the AutoGPTQ CUDA extension hasn't built. Show the full output you see when running the script.
Please show me the output of:
!python -c 'import torch ; print(torch.__version__) ; print(torch.cuda.is_available())'
!nvidia-smi
I can't see any obvious problems there. However you're using the latest development version of AutoGPTQ, which has not been as well tested.
Please use version 0.2.2:
!pip uninstall auto-gptq
!pip install auto-gptq==0.2.2
Run that, then run the following test script:
import torch
import autogptq_cuda
print(torch.__version__)
print(torch.cuda.is_available())
print(autogptq_cuda)
And show me output.
i think problem is with installation am ryt, can you tell me the crt procedure of installation of autogptq
Try building from source:
!git clone https://github.com/PanQiWei/AutoGPTQ
!cd AutoGPTQ && git checkout v0.2.1 && pip install .
Then test again
is it solved? 解决了吗 ?TheBloke 你本地实验的运行速度大概是多少?一秒几个token
it is solved