VRAM Requirements

#3
by innate - opened

I was wondering what the reccomended amount of VRAM neccesary would be to run this model using KoboldAI or oobabooga web-ui.
I've been using the below image as reference but understand it's not exact:

image.png

I apologize if this is a question generated by ignorance of the inner working of LLMs and AI in general but appreciate any help or info you could share to enlighten me.

I've been looking for a model to use for coding assistance and want to know if I'll be able to run this with my current local setup:
Ryzen 7 5800X3D
32GB RAM
2x 3070 8GB

I have a 64BM RAM and RTX 4090 with 24GB GPU. Not sure what is going on, (I am new to all this). Trying to load init the model itself (with the given config) is causing overflow of my system RAM. It did not start loading the weights yet.

BTW I think your table is wrong. If you have 30B params that means bfloat16 takes 2 for each. So it will be 60GB? There is no way it will fit in 24GB GPU Memory, or 32GM RAM. May be you are talking about 4bit or 8bit quantization?

I am looking for someone to talk to on these, Are there any other venues? Is there any live community?

Someone will correct me if I'm wrong, but if you look at the Files list pytorch_model.bin is 31GB. This must be loaded into VRAM. So even a 4090 can't run this as-is.

However, TheBloke quantizes models to 4-bit, which allow them to be loaded by commercial cards. His version of this model is ~9GB.
https://huggingface.co/TheBloke/WizardCoder-15B-1.0-GPTQ

You then load this model into the text-generation UI found here:
https://github.com/oobabooga/text-generation-webui

Aitrepeneur just did a review of TheBloke's quantized version:
https://youtu.be/XjsyHrmd3Xo

Thank you for your reply. The chart I posted does refer to 4bit quantization models, I apologize for the confusion as I myself did not know until you mentioned it and I still need a better understanding of this stuff overall. I'm going to try TheBloke's 4bit model and see how it performs, from Aitrepeneur's review it looks like it'll be good for my purposes right now. Also going to have to read up on some papers that discuss or explain quantization. Here's a good link if anyone else wants it - https://rentry.org/LocalModelsPapers.

And I too overflowed my system trying to run this model as is lol - I watched as my screen color fade from full VRAM uage and 97% of my RAM used before the system crashed.

Innate, you said you had 8GB of VRAM. TheBloke's version is 9GB. So, yes, unfortunately it won't fit. :)

I was able to load TheBloke's version by sharing the memory over my 2 GPUs using the text-generation-webui, since I have 2 8GB 3070's in my PC.
β€’α΄—β€’
image.png

My 80G VRAM is full, but I still can't get a response
vram.png

My code:

model_path = "/data/cache_models/offline_cache/WizardCoder-15B-V1.0"
def init():
    global tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, revision="main", local_files_only=True)
    global llm_model
    llm_model = AutoModelForCausalLM.from_pretrained(
        model_path,
        revision="main",
        device_map="sequential",
        torch_dtype=torch.bfloat16,
    )

Do any one really understand what is going on there?
I have 100's of questions, and no idea whom to ask..

Why does it take more than 30GB (15x2)?
What are the exact computation requirements to generate one token?
What are the computation requirements to fine-tune this model?

I am trying to understand all these in detail by writing my own program to run any Hugging Face model(Without using any of those python modules provided). May be then I will understand.

I do not know the requirements to run this specific model by WizardLM. If it wasn't clear in my above post, I am using TheBloke's 4bit quantization version to reduce the VRAM required ---> https://huggingface.co/TheBloke/WizardCoder-15B-1.0-GPTQ
Model quantization is a method of reducing the size of a trained model while maintaining accuracy. It works by reducing the precision of the weights and activations used by a model without affecting (significantly) the overall accuracy.
That's not to say the quantized model ran perfect out of the box either, I had to read all of the text-generation-webui documentation to get it to load & work properly. I also read a few OpenAI & Meta papers to gain a better understanding of what goes on "under the hood", I'll post the few I've read down below. Not trying to tell anyone what to do but: learning is a journey, frustration is natural, discipline is what will put you over the finish line guide you towards better understanding. <3
https://arxiv.org/abs/2304.12210 - A Cookbook of Self-Supervised Learning
https://arxiv.org/abs/2205.01068 - OPT: Open Pre-trained Transformer Language Models
https://arxiv.org/abs/2305.20050 - Let's Verify Step by Step
If stuff in the papers are going over your head, don't worry. Google is your friend.

innate changed discussion status to closed

Sign up or log in to comment