Hardware requirements for inference?
Where can I find the hardware requirements for this model? (Specifically, can it run on 3060/12GB)?
Theoretically, GPT-JT cannot run on one single 3060 12GB as the model itself takes up ~12GB and thus so there is not enough memory for inference. I'll recommend VRAM >= 16GB. An alternative is to use multiple 3060 GPUs with accelerate
:
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory
# Load model to CPU
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
max_memory = get_balanced_memory(
model,
max_memory=None,
no_split_module_classes=["GPTJBlock"],
dtype='float16',
low_zero=False,
)
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["GPTJBlock"],
dtype='float16'
)
model = dispatch_model(model, device_map=device_map)
I'm using this code and inference still takes ~12 seconds. I use NVIDIA T4 x 2
. For inference I use the command model.generate
, do you know if I need to do anything else to make it use GPU?
Do you have a code snippet with an inference example, which uses GPU? :) That would be awesome.
Thanks for the good work!
@billy-ai
Sorry for the late reply. If you use this code, the inference should run on GPU.
-- How many tokens were you trying to generate? It's possible to be slow if max_new_tokens
is large.
If you use T4 with 16GB VRAM, simply moving the model to GPUmodel = model.half().to('cuda:0')
and calling output = model.generate(input_ids, max_new_tokens=10)
are enough to GPU.
If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?
If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?
Sure, you can run it on CPU without any problem. You can also try quantization: model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-JT-6B-v1', device_map='auto', load_in_8bit=True, int8_threshold=6.0)
:)
Theoretically, GPT-JT cannot run on one single 3060 12GB as the model itself takes up ~12GB and thus so there is not enough memory for inference. I'll recommend VRAM >= 16GB. An alternative is to use multiple 3060 GPUs with
accelerate
:
from transformers import AutoTokenizer, AutoModelForCausalLM from accelerate import dispatch_model, infer_auto_device_map from accelerate.utils import get_balanced_memory # Load model to CPU tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1") model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1") max_memory = get_balanced_memory( model, max_memory=None, no_split_module_classes=["GPTJBlock"], dtype='float16', low_zero=False, ) device_map = infer_auto_device_map( model, max_memory=max_memory, no_split_module_classes=["GPTJBlock"], dtype='float16' ) model = dispatch_model(model, device_map=device_map)
Thanks! Sadly, won't be able to get another GPU soon!