high memory use
EXAONE uses a lot more memory for context compared to Qwen 2.5. Is this inherent to the model or is it something wrong with llama.cpp?
Hi, electroglyph.
Would you give us more information (e.g., gguf type and llama-cli parameters) for testing?
When compared EXAONE-3.5-2.4B-Instruct-BF16.gguf and qwen2.5-3b-instruct-fp16.gguf with the same parameters (llama-cli -cnv -m '...' -p '...') on CPU, EXAONE used less memory.
i've tested using some of the GPU backends, i.e. SYCL, Vulkan, etc. my context limit is around 50% of what it is with Qwen 2.5 3B. i've tested several versions of llama.cpp so far. i'm going to do some more testing and i'll be back with more detailed information.
...my context limit is somewhere around 60K with EXAONE 2.4B, but I can hit 120K with Qwen 2.5 3B (no quantization). these small models are great for running in parallel, so my actual context is divided by how many parallel tasks I'm running. the lower context limit means i have to lower how many i run in parallel
After more testing I can update this to say the context limit is nearly exactly 50% that of Qwen 2.5 3B.
I've opened an issue here if you want to weigh in:
https://github.com/ggerganov/llama.cpp/issues/10823
Hi, 0xDEADFED5
It is due to the differences of architecture between EXAONE 3.5 2.4B and Qwen 2.5 3B. To be specific, num_attention_heads and num_key_value_heads are difference between them.
Thank you.
thanks!