How much vram+ram 30B needs? I have 3060 12gb + 32gb ram.
bump
Needs 24GB VRAM to load entirely on the GPU. You can try using text-generation-webui's pre_layer feature to load some layers on the GPU, some on the CPU. try pre_layer 30 as a starting figure.
I can't remember if pre_layer is the number of layers on the CPU, or the number on the GPU. I think it means number of layers on the GPU, so if you get out-of-memory with 30, try decreasing it to 20.
the hf 16fp version requires 63B VRAM. the GPTQ-4bit 128 group size version needs about 25GB, the GPTQ-4bit 1024 group size just fit in 24GB card but ooba has trouble in dealing with 1024 group size though.
The 30B and above versions of LLaMA are pretty unapproachable for commodity devices at this moment.
This version used no group size so it will definitely fit in 24gb. I stopped doing 1024 because for 30b will OOM with long responses. Group size none is reliable in 24 though
I have a 3090 and still getting error about memory when loading it up.
Then that must be something else. It loads fine on a 24GB 4090 for me, testing with the ooba GPTQ-for-LLaMA CUDA fork.
Loading the model uses around 18GB VRAM, and then this grows as the response comes back, up to a maximum of 2000 tokens which uses 24203 MiB, leaving 13 MiB free :)
Output generated in 249.28 seconds (8.02 tokens/s, 1999 tokens, context 42, seed 953298877)
timestamp, name, driver_version, pcie.link.gen.max, pcie.link.gen.current, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/22 19:40:04.420, NVIDIA GeForce RTX 4090, 525.105.17, 4, 4, 20 %, 16 %, 24564 MiB, 13 MiB, 24203 MiB
I think their problem is normal ram. For me and my 4090 it first loads entirely into my normal ram which maxed out my 32GB, then it shifts to the gpu. So if they don’t have enough system ram I don’t even think it tried to send to gpu
Ah yes, that could be it. You generally always need at least as much RAM as you have VRAM.
If using ooba, you need a lot of RAM to just load the model (or filepage if you don't have enough RAM), for 65b models I need like 140+GB of RAM (between RAM and pagefile size)
Interesting, I have 32gb of ram 31.7 usable
Here is the error stack:
Traceback (most recent call last):
File “E:\oobabooga_windows\text-generation-webui\server.py”, line 67, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “E:\oobabooga_windows\text-generation-webui\modules\models.py”, line 159, in load_model
model = load_quantized(model_name)
File “E:\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py”, line 178, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File “E:\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py”, line 52, in _load_quant
model = AutoModelForCausalLM.from_config(config, trust_remote_code=shared.args.trust_remote_code)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py”, line 411, in from_config
return model_class._from_config(config, **kwargs)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py”, line 1146, in _from_config
model = cls(config, **kwargs)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 614, in init
self.model = LlamaModel(config)
File “E:\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py”, line 21, in init
super().init(config)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 445, in init
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 445, in
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 256, in init
self.mlp = LlamaMLP(
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 151, in init
self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\linear.py”, line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 238551040 bytes.
These 30b can take over 64gb of your system ram, which is why you need that extra pagefile/swap area. Does everyone just kill their x server and plug into the motherboards hdmi to get their cards vram free? Any tricks would be welcome :)
Is there any existing framework that allows offloading even for GPTQ models? In principle, this should be doable.
Yes, check my first response in this thread: pre_layer in GPTQ-for-LLaMa supports offloading. This is supported in the text-generation-webui UI.
Set pre_layer
to the number of layers to put on the GPU. There are 60 layers in total in this model. So eg on a 16GB card, you could try --pre_layer 35
to put 35 layers on the GPU and the rest on the CPU. It will be really slow though. If you don't have enough VRAM to fully load the model, I recommend trying a GGML model instead, and load as many layers onto the GPU eg with -ngl 50
to put 50 layers on the GPU (which fits in 16GB VRAM).
With GPTQ, the GPU needs enough VRAM to fit both the model, and the context. With GGML and llama.cpp, GPU offloading stores the model but does not store the context, so you can fit more layers in a given amount of VRAM.
Generally GPTQ is faster than GGML if you have enough VRAM to fully load the model. But if you don't, GGML is now faster - and it can be much faster. Eg testing this 30B model yesterday on a 16GB A4000 GPU, I less than 1 token/s with --pre_layer 38
but 4.5 tokens/s with GGML and llama.cpp with -ngl 50
.
Regarding multi-GPU with GPTQ:
In recent versions of text-generation-webui you can also use pre_layer
for multi-GPU splitting, eg --pre_layer 30 30
to put 30 layers on each GPU of two GPUs.
Regarding VRAM capacity, remember that if your card is also a primary display unit, it will not have full 24GB available for the model because some portion (~300-500MB) will be used by the OS for display.
Obviously, if you run this headless, or with 2 video cards, then there should be no issues.
This model takes up about 18GB of VRAM on my 3090. I have auto-devices disabled in Ooba. It fits comfortably on the GPU with some room to spare. System RAM has nothing to do with it (I have 32GB of that).
If you're getting OOM errors on a 24GB card you're probably running some other GPU-intensive program at the same time, otherwise I have no explanation
This model takes up about 18GB of VRAM on my 3090. I have auto-devices disabled in Ooba. It fits comfortably on the GPU with some room to spare. System RAM has nothing to do with it (I have 32GB of that).
If you're getting OOM errors on a 24GB card you're probably running some other GPU-intensive program at the same time, otherwise I have no explanation
What settings are you using to load the model? I have the same rig as you and it keeps crashing
With the latest text-gen-webui you really don't have to do anything, AutoGPTQ is used automaticall, and unless you specify --triton it'll default to CUDA.
So probably:
python server.py --wbits 4 --groupsize -1 --model_type LLaMA --model
Pretty much, but I don't even specify groupsize, then I made sure that auto-devices is unflagged in the UI
With the latest text-gen-webui you really don't have to do anything, AutoGPTQ is used automaticall, and unless you specify --triton it'll default to CUDA.
So probably:
python server.py --wbits 4 --groupsize -1 --model_type LLaMA --model
I tried to follow the suggestions you made, but am not sure what I'm still doing wrong. I encounter this error every time:
INFO:Loading TheBloke_WizardLM-30B-Uncensored-GPTQ...
INFO:The AutoGPTQ params are: {'model_basename': 'WizardLM-30B-Uncensored-GPTQ-4bit.act-order', 'device': 'cuda:0', 'use_triton': False, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None}
WARNING:The safetensors archive passed at models\TheBloke_WizardLM-30B-Uncensored-GPTQ\WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors does not contain metadata. Make sure to save your model with the save_pretrained
method. Defaulting to 'pt' metadata.
Press any key to continue . .
Just adding another data point RE: not enough system RAM
I had a similar issue with my setup, where I have more than enough VRAM but wasn't able to load the modal because text-gen-webui keeps running out of system memory (RAM). For me I just had to increase my virtual memory (swap if you on linux). And it fixed things. Also just watching the RAM and VRAM usage while the modal is loaded, I observed that it first loaded the modal (or more likely a part of it) to RAM (and swap because there wasn't enough RAM) and then it would load it to VRAM.
Which is better option(A6000 x2 or 3090 x2 SLI) for LLM model?