Out of memory on two 3090

#21

by gameveloster - opened Jun 18, 2023

Jun 18, 2023

Tried loading the model using Exllama on two 3090 but kept getting the out of memory error. When this crashes, the first GPU VRAM was fully utilized (23.69GB) but the 2nd GPU only used 7.87 GB of VRAM.

$ python server.py --model TheBloke_guanaco-65B-GPTQ --listen --chat --loader exllama --gpu-split 24,24
bin /home/gameveloster/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
2023-06-18 15:57:03 INFO:Loading TheBloke_guanaco-65B-GPTQ...
Traceback (most recent call last):
  File "/mnt/md0/text-generation-webui/server.py", line 1014, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/mnt/md0/text-generation-webui/modules/models.py", line 65, in load_model
    output = load_func_map[loader](model_name)
  File "/mnt/md0/text-generation-webui/modules/models.py", line 277, in ExLlama_loader
    model, tokenizer = ExllamaModel.from_pretrained(model_name)
  File "/mnt/md0/text-generation-webui/modules/exllama.py", line 41, in from_pretrained
    model = ExLlama(config)
  File "/mnt/md0/text-generation-webui/repositories/exllama/model.py", line 630, in __init__
    tensor = tensor.to(device, non_blocking = True)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 23.01 GiB already allocated; 35.12 MiB free; 23.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Should this model be loadable on two 3090 when using Exllama?

TheBloke

Owner Jun 20, 2023

Yes, but you need an unequal split to allow for context on GPU 1. From the exllama README:

So try --gpu-split 17.2,24 or similar

zaidorx

Jun 26, 2023

I was running into this OOM issue even before exllama. Following this recomedation, --gpu-split 17.2,24, now it works perfectly and I am getting 12 tokens/s. Impressive!

fouvy

Aug 8, 2023

use_triton = False
m = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_cuda_fp16 = False,
use_safetensors=True,
trust_remote_code=True,
device_map="auto",
max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
use_triton=use_triton,
quantize_config=None)

TheBloke

Owner Aug 8, 2023

You need to put less on GPU 1 to allow for context. Try 16GB GPU1, 24GB GPU 2

Or you'll get much better performance with ExLlama, and lower GPU usage too. Here's example code using ExLlama (there's more examples in the same repo): https://github.com/turboderp/exllama/blob/c16cf49c3f19e887da31d671a713619c8626484e/example_basic.py

To that basic ExLlama code you would add config.set_auto_map("17.2,24") ; config.gpu_peer_fix = True for splitting over two GPUs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment