Bug: Generate method doesn't work for falcon-7b and falcon-40b in int8 mode.
#22
by
avacaondata
- opened
System Info
transformers
version: 4.30.0.dev0- Platform: Linux-5.15.0-72-generic-x86_64-with-glibc2.35
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help?
@ArthurZucker @younes
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce the behavior:
- Import modules and load the model:
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
model_path="tiiuae/falcon-40b"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, config=config, trust_remote_code=True, load_in_8bit=True, device_map="auto")
model.eval()
model.config.eos_token_id = 0
model.config.forced_eos_token_id = 0
model.config.pad_token_id = 0
- Tokenize a text:
text = "Hola qué tal estás Íñigo? ¿Qué vas a hacer hoy?"
inpts = tokenizer(text, return_tensors="pt").to("cuda")
- Try to generate text:
out = model.generate(**{k: v for k, v in inpts.items() if "token_type" not in k})
You will receive the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[13], line 1
----> 1 out = model.generate(**{k: v for k, v in inpts.items() if "token_type" not in k})
File ~/miniconda3/envs/int4/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/miniconda3/envs/int4/lib/python3.9/site-packages/transformers/generation/utils.py:1518, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
1512 raise ValueError(
1513 "num_return_sequences has to be 1 when doing greedy search, "
1514 f"but is {generation_config.num_return_sequences}."
1515 )
1517 # 11. run greedy search
-> 1518 return self.greedy_search(
1519 input_ids,
1520 logits_processor=logits_processor,
1521 stopping_criteria=stopping_criteria,
1522 pad_token_id=generation_config.pad_token_id,
1523 eos_token_id=generation_config.eos_token_id,
1524 output_scores=generation_config.output_scores,
1525 return_dict_in_generate=generation_config.return_dict_in_generate,
...
291 )
293 x = attn_output.view(batch_size, self.num_heads, q_length, self.head_dim)
294 x = x.permute(0, 2, 1, 3)
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.
Expected behavior
It is expected that the falcon-40b model is able to generate also with int8, otherwise we cannot perform inference even on a 80GB A-100. Also, other models have no problem with inference in 8bit.