Out of memory on two 3090
Tried loading the model using Exllama on two 3090 but kept getting the out of memory error. When this crashes, the first GPU VRAM was fully utilized (23.69GB) but the 2nd GPU only used 7.87 GB of VRAM.
$ python server.py --model TheBloke_guanaco-65B-GPTQ --listen --chat --loader exllama --gpu-split 24,24
bin /home/gameveloster/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
2023-06-18 15:57:03 INFO:Loading TheBloke_guanaco-65B-GPTQ...
Traceback (most recent call last):
File "/mnt/md0/text-generation-webui/server.py", line 1014, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/mnt/md0/text-generation-webui/modules/models.py", line 65, in load_model
output = load_func_map[loader](model_name)
File "/mnt/md0/text-generation-webui/modules/models.py", line 277, in ExLlama_loader
model, tokenizer = ExllamaModel.from_pretrained(model_name)
File "/mnt/md0/text-generation-webui/modules/exllama.py", line 41, in from_pretrained
model = ExLlama(config)
File "/mnt/md0/text-generation-webui/repositories/exllama/model.py", line 630, in __init__
tensor = tensor.to(device, non_blocking = True)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 23.01 GiB already allocated; 35.12 MiB free; 23.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Should this model be loadable on two 3090 when using Exllama?
I was running into this OOM issue even before exllama. Following this recomedation, --gpu-split 17.2,24, now it works perfectly and I am getting 12 tokens/s. Impressive!
use_triton = False
m = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_cuda_fp16 = False,
use_safetensors=True,
trust_remote_code=True,
device_map="auto",
max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
use_triton=use_triton,
quantize_config=None)
You need to put less on GPU 1 to allow for context. Try 16GB GPU1, 24GB GPU 2
Or you'll get much better performance with ExLlama, and lower GPU usage too. Here's example code using ExLlama (there's more examples in the same repo): https://github.com/turboderp/exllama/blob/c16cf49c3f19e887da31d671a713619c8626484e/example_basic.py
To that basic ExLlama code you would add config.set_auto_map("17.2,24") ; config.gpu_peer_fix = True
for splitting over two GPUs.