TheBloke/Yarn-Llama-2-13B-64K-GGUF · RAM required for full 64k context?

Sep 1, 2023

Sorry if this is a dumb question but I've heard that 360GB RAM is required for a full 128k context of this 13B model, how much does it require for a full 64k context? Is the 10.37 GB requirement for the Q4_K_M quantized version still valid?

TheBloke

Owner Sep 1, 2023

Good question. That RAM requirement is set statically according to the size of the model. I have not measured RAM requirements at extended context.

I'll do a quick test now

TheBloke

Owner Sep 1, 2023

•

edited Sep 1, 2023

OK, I tested Llama-2-7B-64K with -c 65736 and an input file containing plain text which I generated from Transformers tokenizer.decode(..) from 64K wikitext tokens. (I set -c 65736 to allow 200 tokens of reply, not that I got that far.)

RAM usage quickly went to 80GB - this was shortly after I started the command:

At about 20% through the prompt , it was using 87GB RAM, so a growth of 7GB over 20% prompt ingestion. At 30%, it was at 96GB. At 45%, 107GB. At 51%, 115G.

It was also using VRAM at the same time, and eventually went OOM at 60% through the prompt. I didn't tell it to use the GPU, I had -ngl 0, so no weights were cached. But I guess it was using cuBLAS for prompt ingestion.

So I don't know the final RAM figure, and I don't know if RAM usage would have been even higher without cuBLAS on a 48GB GPU.

But getting some very rough figures:

It used an additional 3.5G RAM per 10% of the prompt at 20% through, then 5.3GB per 10% at 30%, and 7GB per 10% at 50% of the prompt. So if by 100% it were using 14GB per 10%, total RAM usage would be 220GB for 7B 64k. Though maybe it'd be even higher than that.

I probably don't have those figures right, but we can definitely see that RAM usage per token is increasing as it gets further, which is expected without flash attention; quadratic RAM/VRAM growth as context increases.

So yeah I could definitely believe 360GB for 13B at 128k context!

MichaelBui

Sep 1, 2023

Thanks for the detailed info. So if I have about 24GB RAM, I won't be able to load this (since your initial RAM is already 80GB)?
I think I will stick with the original Llama2 with 4k context for now.

RajeshkumarV

Sep 6, 2023

Is there a 4k context version of the model in the GGUF format pls? llama2