Hi bro, I am newbie to qlora, I tried below code and it raises OSError. Can you tell me how to load and use this using python.
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/guanaco-65B-GPTQ")

model = AutoModelForCausalLM.from_pretrained("TheBloke/guanaco-65B-GPTQ")

OSError Traceback (most recent call last)
in <cell line: 5>()
3 tokenizer = AutoTokenizer.from_pretrained("TheBloke/guanaco-65B-GPTQ")
----> 5 model = AutoModelForCausalLM.from_pretrained("TheBloke/guanaco-65B-GPTQ")

1 frames
/usr/local/lib/python3.10/dist-packages/transformers/ in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
2553 )
2554 else:
-> 2555 raise EnvironmentError(
2556 f"{pretrained_model_name_or_path} does not appear to have a file named"
2557 f" {_add_variant(WEIGHTS_NAME, variant)}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or"

OSError: TheBloke/guanaco-65B-GPTQ does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

You can't load GPTQ models from regular transformers, you need AutoGPTQ

pip install auto-gptq

Here is example code:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

model_name_or_path = "TheBloke/guanaco-65B-GPTQ"
model_basename = "Guanaco-65B-GPTQ-4bit.act-order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,

prompt = "Tell me about AI"
prompt_template=f'''### Instruction: {prompt}
### Response:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ

print("*** Pipeline:")
pipe = pipeline(


Thank you bro 😊😊

First of all, thanks a lot for your work!
I encounter an issue which is directly caused by following codes:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,

it first warns me:

WARNING 2023-07-03 22:36:45,587-1d: CUDA extension not installed.
WARNING 2023-07-03 22:36:58,012-1d: The safetensors archive passed at /home/mydir/.cache/huggingface/hub/models--TheBloke--guanaco-65B-GPTQ/snapshots/c1a31c76e7228a13bc542b25243b912f12e39c87/Guanaco-65B-GPTQ-4bit.act-order.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.

after a huge amount of information about device_map, it raises the following error:

C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.

Error Message Summary:

FatalError: Access to an undefined portion of a memory object is detected by the operating system.
[TimeInfo: *** Aborted at 1688395081 (unix time) try "date -d @1688395081" if you are using GNU date ***]
[SignalInfo: *** SIGBUS (@0x7fbce9c3dff0) received by PID 424101 (TID 0x7fbea6e7e740) from PID 18446744073336512496 ***]


I pretty sure that I have my cudatoolkit installted, do you have any clue about the problerm?
Again, thanks for your work and hope for your reply.

Firstly, just to check you're running this on a system with an Nvidia GPU available, with at least 48GB VRAM?

If so, the first problem is that the CUDA extension is not installed. Please try re-installing auto-gptq with:

pip3 uninstall -y auto-gptq
GITHUB_ACTIONS=true pip3 install auto-gptq

Not sure about the rest, let's see if installing AutoGPTQ with the CUDA module available fixes that first.

Thank you so much

