torch.cuda.OutOfMemoryError: CUDA out of memory

by asterix51 - opened Nov 23, 2023

Nov 23, 2023

Yep, apparently RTX 4080 16GB and 32gigs DRAM won't cut it.

"torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacty of 15.99 GiB of which 0 bytes is free. Of the allocated memory 30.20 GiB is allocated by PyTorch, and 56.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

DreamGenX

DreamGen org Nov 23, 2023

Hi there, can you share more information so that I can assist you?

Which version of the model are you trying to run (this one, or some quant?)
How are you running the model (what is the program and command)

This is a 70B parameter model, so it's not possible to run the full precision model on any single graphics card (even the H100 :)). The AWQ quant dreamgen/opus-v0-70b-awq takes at least 35G (I am running it on a 48GB card).

You could try a few things:

Run the 7B model instead dreamgen/opus-v0-7b -- you can also try the GGUF Q8 and Q6 quants, which are not bad (the smaller ones are quite bad)
Run the 70B quantized model on CPU -- this is going to be very slow, and I have not tried this myself
Rent a GPU in the cloud
Try it on dreamgen.com -- you can try both the 7B and the 70B AWQ versions

asterix51

Nov 23, 2023

Tried v0-70b-awq version with text-gen-web-ui. I can't use dreamgen service either since I can't comply with their data collection policy as my work is protected.

DreamGenX

DreamGen org Nov 23, 2023

Thanks for the details, the 70B-AWQ requires at least 35G of VRAM, so you can't run it with your GPU. You can try this quantized version, the smallest, from TheBloke: https://huggingface.co/TheBloke/opus-v0-70B-GGUF/blob/main/opus-v0-70b.Q2_K.gguf you could then try running this with llama.cpp backend on your CPU (or offload some work to GPU as well), but expect it to be slow.

Otherwise, you could try running on some GPU cloud, I use RunPod. To run the 70B AWQ model, I use A6000 with the vLLM backend.

asterix51

Nov 23, 2023

Thanks for the tips. I'll give it a go.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment