LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct

16 days ago

EXAONE uses a lot more memory for context compared to Qwen 2.5. Is this inherent to the model or is it something wrong with llama.cpp?

yireun

LG AI Research org 16 days ago

•

edited 16 days ago

Hi, electroglyph.

Would you give us more information (e.g., gguf type and llama-cli parameters) for testing?
When compared EXAONE-3.5-2.4B-Instruct-BF16.gguf and qwen2.5-3b-instruct-fp16.gguf with the same parameters (llama-cli -cnv -m '...' -p '...') on CPU, EXAONE used less memory.

0xDEADFED5

14 days ago

•

edited 12 days ago

i've tested using some of the GPU backends, i.e. SYCL, Vulkan, etc. my context limit is around 50% of what it is with Qwen 2.5 3B. i've tested several versions of llama.cpp so far. i'm going to do some more testing and i'll be back with more detailed information.

...my context limit is somewhere around 60K with EXAONE 2.4B, but I can hit 120K with Qwen 2.5 3B (no quantization). these small models are great for running in parallel, so my actual context is divided by how many parallel tasks I'm running. the lower context limit means i have to lower how many i run in parallel

0xDEADFED5

12 days ago

After more testing I can update this to say the context limit is nearly exactly 50% that of Qwen 2.5 3B.

I've opened an issue here if you want to weigh in:
https://github.com/ggerganov/llama.cpp/issues/10823

yireun

LG AI Research org 10 days ago

Hi, 0xDEADFED5

It is due to the differences of architecture between EXAONE 3.5 2.4B and Qwen 2.5 3B. To be specific, num_attention_heads and num_key_value_heads are difference between them.

Thank you.

0xDEADFED5

10 days ago

thanks!

LGAI-EXAONE
/

EXAONE-3.5-2.4B-Instruct

high memory use