RAM required for full 64k context?
Sorry if this is a dumb question but I've heard that 360GB RAM is required for a full 128k context of this 13B model, how much does it require for a full 64k context? Is the 10.37 GB
requirement for the Q4_K_M
quantized version still valid?
Good question. That RAM requirement is set statically according to the size of the model. I have not measured RAM requirements at extended context.
I'll do a quick test now
OK, I tested Llama-2-7B-64K with -c 65736
and an input file containing plain text which I generated from Transformers tokenizer.decode(..)
from 64K wikitext tokens. (I set -c 65736
to allow 200 tokens of reply, not that I got that far.)
RAM usage quickly went to 80GB - this was shortly after I started the command:
At about 20% through the prompt , it was using 87GB RAM, so a growth of 7GB over 20% prompt ingestion. At 30%, it was at 96GB. At 45%, 107GB. At 51%, 115G.
It was also using VRAM at the same time, and eventually went OOM at 60% through the prompt. I didn't tell it to use the GPU, I had -ngl 0
, so no weights were cached. But I guess it was using cuBLAS for prompt ingestion.
So I don't know the final RAM figure, and I don't know if RAM usage would have been even higher without cuBLAS on a 48GB GPU.
But getting some very rough figures:
It used an additional 3.5G RAM per 10% of the prompt at 20% through, then 5.3GB per 10% at 30%, and 7GB per 10% at 50% of the prompt. So if by 100% it were using 14GB per 10%, total RAM usage would be 220GB for 7B 64k. Though maybe it'd be even higher than that.
I probably don't have those figures right, but we can definitely see that RAM usage per token is increasing as it gets further, which is expected without flash attention; quadratic RAM/VRAM growth as context increases.
So yeah I could definitely believe 360GB for 13B at 128k context!
Thanks for the detailed info. So if I have about 24GB RAM, I won't be able to load this (since your initial RAM is already 80GB)?
I think I will stick with the original Llama2 with 4k context for now.
Is there a 4k context version of the model in the GGUF format pls? llama2