Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

BLOOM models don't run on my GPU

#114
by TornButter - opened

The following code successfully runs on my CPU, maxing out a few cores while 3090's usage remains at 0%:
import torch
from transformers import BloomTokenizerFast, BloomForCausalLM
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m")
prompt = "Dave picked up the baseball and"
result_length = 100
inputs = tokenizer(prompt, return_tensors="pt")
raw = model.generate(inputs["input_ids"],max_length=result_length)[0]
print(tokenizer.decode(raw))

However, I want to use my GPU. I have tried using different models like 1b7, but the same result. When using accelerate and device_map="auto", torch_dtype="auto", the models run on my GPU, but when trying to decode, I get an error that says "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!". When I add device = torch.device("cuda:0") and .to(device) to the end of the model line, I get the same runtime error. Same with .cuda(). I made sure to configure accelerate to not use my CPU. What am I doing wrong?

I found the solution. I added these 2 lines before raw, then made raw run on inputs2

device = torch.device(“cuda:0”)
inputs2 = inputs.to(device)

Here is the complete working code:
import torch
from transformers import BloomTokenizerFast, BloomForCausalLM
tokenizer = BloomTokenizerFast.from_pretrained(“bigscience/bloom-560m”)
model = BloomForCausalLM.from_pretrained(“bigscience/bloom-560m”).cuda()
prompt = “Dave picked up the baseball and”
result_length = 100
inputs = tokenizer(prompt, return_tensors=“pt”)
device = torch.device(“cuda:0”)
inputs2 = inputs.to(device)
raw = model.generate(inputs2[“input_ids”],max_length=result_length)[0]
print(tokenizer.decode(raw))

TornButter changed discussion status to closed

Sign up or log in to comment