How to Reduce Memory Use in Reasoning Models
🔳 we explore how combining LightThinker and Multi-Head Latent Attention cuts memory and boosts performance
AI models have shifted from thinking quickly (giving fast answers) to thinking more carefully by breaking problems into smaller steps. o1-like thinking with implementing Chain-of-Thoughts method allows large reasoning models, such as OpenAI’s o1, o3, and DeepSeek-R1, to backtrack, retry, and refine its reasoning, making it even better at solving tricky problems. We discussed all important aspects and advantages of scaling test-time compute in one of our previous episodes. However, there is a big issue: this kind of reasoning creates a lot of text (tokens), which takes up memory and slows things down, making processing more expensive. This is especially noticeable with Transformers – the more text they generate, the more memory and computing power they need. As large reasoning models become more prevalent, we must find ways to mitigate their weaknesses while fully exploring their potential for improvement.
Today we will focus on the problem of increased memory use and, as a result, too long processing time because of this. If we can address memory inefficiency, models can become more balanced and effective while maintaining their high accuracy. Two notable approaches have already been proposed to reduce memory usage in reasoning models: 1) LightThinker that helps models learn how to summarize their own “thoughts” and solve tasks based on these short, meaningful summarizations; and 2) Multi-head Latent Attention (MLA), a DeepSeek solution, proposed back when they released DeepSeek-V2 and later implemented in DeepSeek-V3 and DeepSeek-R1.
Today we invite you to dive into these concepts with us and consider the potential benefits of blending them together.
📨 Click follow! If you want to receive our articles straight to your inbox, please subscribe here
In today’s episode, we will cover:
- What is LightThinker?
- What is Multi-Head Latent Attention (MLA)?
- What if we blend LightThinker and MLA concepts?
- Conclusion
- Bonus: Resources to dive deeper
What is LightThinker?
The idea behind LightThinker
As we have already said, we need optimization methods that would make high-quality reasoning models much faster and more efficient, avoiding high memory cost.
One of these methods is LightThinker developed by Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph. LightThinker doesn’t just cut out words or memory manually, it teaches the model to "summarize" its own “thoughts” while solving problems. Think of it like how people jot down key points instead of writing every detail. Let’s look at how it works in detail.
Image Credit: The original LightThinker paper
How does LightThinker work?
In general, instead of keeping long, detailed reasoning steps, LightThinker compresses them into shorter, essential summaries and then continues reasoning based on them.
What’s important is that LightThinker does two things:
- Decides when to compress reasoning steps.
- Decides how to compress them.
Here are the techniques that are used to do these actions.
- When to compress? There are two ways to decide when to summarize the model’s “thoughts”:
- Token-level compression: The model compresses “thoughts” after a fixed number of words. This is simple but might cut off thoughts awkwardly.
- Thought-level compression: The model compresses “thoughts” after completing a full idea, like a sentence or paragraph. This keeps thoughts more organized but requires extra processing to decide when a thought is complete. Researchers prefer to use this method of compression in LightThinker because it preserves meaning better.
- How to compress? There are also two main ways to summarize information:
- Text compression: The model replaces long thoughts with shorter summaries. However, this requires extra processing with an additional encoding model and slows things down.
- Hidden state compression: Instead of rewriting text, the model stores key details in special tokens. These tokens act like mental notes that the AI model can use later. Researchers choose hidden state compression because it is faster and doesn’t require extra models.
Here is how LightThinker implements compression step by step:
- Before summarizing, LightThinker breaks down the text into smaller sections. It inserts special tokens between sections to mark summaries:
- (optional) → A marker telling the model to compress the previous thought.
- C (cache tokens) → Means the summary tokens of previous thought that store key points.
- [o] (output token) → A marker saying "Use this summary to continue reasoning." It helps the model generate new content based on what has been compressed.
So, the sequence of three text sections may look like this: → Section 1 → → Summary (C) → [o] → Section 2 → → Summary (C) → [o] → Section 3. This stage is called “Data Reconstruction”.
- Learning to compress and use the summaries: Once the data is structured, the model learns when and how to compress information. Thought-based attention mask controls what the model can and cannot “look at” during each step:
- During compression, the model can process only the original input, previous compressed content (C) and the current thought.
- During generating the output token, the model can only “look at” the input question and the previous compressed content. This ensures the model only looks at summaries instead of raw data when reasoning. Here is the illustration of LightThinker’s attention mask during three-step reasoning:
Image Credit: The original LightThinker paper
- Finally, the model is trained to predict the next tokens using only the summaries (C). It is not allowed to “cheat” by looking at the full text and must learn how to store information efficiently and retrieve it step by step.
Overall, LightThinker compresses long thoughts into short memory tokens and continues reasoning with only the important details. Important feature is that LightThinker is designed to separate memory compression from reasoning for better accuracy and efficiency.
Let’s look at the actual results of LightThinker to explore why it’s a better approach.
Actual performance of LightThinker
LightThinker makes AI models more practical and efficient for real-world tasks. The results of its performance are really impressive:
- Memory reduction: LightThinker reduces peak token usage by 70%, meaning it stores way less unnecessary information.
- Faster processing: It speeds up inference time by 26% on Qwen2.5-7B and 41% on Llama3.1-8B (it was tested on these two models). Inference time reduction also depends on the length of response. For example, LightThinker reduces inference time by 44% when generating long responses (32K tokens). For shorter texts (1K–4K tokens), it still saves 1%–4% of inference time.
- Minimal accuracy loss: Accuracy drops only 1% on Qwen and 6% on Llama, which is a reasonable tradeoff for the efficiency gains.
- Generating fewer tokens: LightThinker reduces the number of tokens the model generates by 15% on Qwen and 13% on Llama.
- For simple tasks like math problems, it compresses more aggressively, while for complex tasks like GPQA (Graduate-Level Google-Proof Q&A Benchmark), it compresses more carefully to keep important information.
- Thanks to LightThinker’s design that separates the compression and reasoning steps, accuracy improves by 2%, and when combined with its attention mask strategy, performance is further boosted by 7%.
- The right cache size matters: A small cache size (number of tokens LightThinker stores in memory) compresses more frequently, while a large cache size compresses less but keeps more information. That’s why increasing the cache size improves accuracy and reduces inference time.
- However, LightThinker struggles with math: Math problems can sometimes be tricky for it because numerical values may be compressed incorrectly.
Image Credit: The original LightThinker paper
As LightThinker is a summarization method, let's briefly outline its benefits and limitations so we can retain these meaningful "notes" for the conclusion of the article. :)
Why is LightThinker a better approach?
- It separates thinking and summarizing for better efficiency.
- Keeps better track of past thoughts.
- Saves memory while staying accurate.
- Runs by 44% faster.
Not without limitations
- Struggles with math tasks.
- It is not the best approach for Llama models.
- Its dynamic compression process can sometimes cause sudden peaks in memory usage.
- LightThinker uses a fixed number of cache tokens for training, but it is uncertain if it can adapt to varying token needs in real-world tasks.
- It is unclear if training LightThinker on much larger datasets would make it even better.
- LightThinker was trained using full-parameter fine-tuning, which is resource-heavy, while more efficient tuning methods like LoRA and QLoRA, which use fewer parameters, were not tested.
These limitations provide room for LightThinkers to improve and enhance its flexibility. However, it has already demonstrated beneficial memory use reduction and much faster inference.
Now, it’s time to move on to DeepSeek’s advancement, which also effectively reduces memory use but works in a completely different way.
What is Multi-Head Latent Attention (MLA)?
Why is MLA needed?
Highly notable DeepSeek-R1 reasoning model incorporates two technical innovations, that were proposed back when DeepSeek-V2 was released (we briefly covered them in our AI101 episode about DeepSeek). They include:
- Special Mixture-of-Experts system which consists of multiple expert sub-models (each expert is also divided into smaller, specialized parts) and shared experts that always stay active, handling common knowledge.
- A specialized attention mechanism called Multi-Head Latent Attention (MLA).
Image Credit: The original DeepSeek-V2 paper
Why did DeepSeek decided to modify traditional Multi-Head Attention (MHA) mechanism for their best AI models?
In Transformers, MHA mechanism helps focus on the most relevant parts of the input to process and generate text. However, MHA stores a lot of key-value (KV) pairs during inference, and this takes up a huge amount of memory and makes the model’s work slow.
DeepSeek’s Multi-Head Latent Attention (MLA) is a modified attention mechanism that compresses the KV cache into a much smaller form. MLA is like a smart storage system that compresses past information efficiently while keeping it accessible for future use. It does this through a technique called low-rank key-value joint compression. This allows the model to process information faster and with less memory without losing accuracy. How exactly does MLA compress the KV cache?
How does MLA work?
MLA reduces KV storage while maintaining strong performance through a technique called low-rank key-value joint compression:
Image Credit: The original DeepSeek-V2 paper
- Compression of key-value (KV) pairs: Instead of storing all full-sized KV pairs separately for each token, MLA compresses them into a smaller, lower-dimensional representation before saving them. This means that MLA projects them into a smaller latent space, using a math transformation.
- Decompression for attention computation: When the model needs to use the stored KV pairs, it reconstructs them by upscaling the compressed data back to its original size. The compressed KV data is re-expanded during inference, allowing the model to work as if it still had the full-size KV pairs. This is useful, because the model still gets the important information from previous tokens, but the memory footprint remains small, making inference faster.
- Decoupled Rotary Position Embedding (RoPE) RoPE modifies K and Q to include positional data, so the model can also remember word order. However, when compress KV pairs, the position information also gets tangled up in the compressed form, making it hard to reconstruct correctly. MLA solves this problem too. It separates, or decouples, the position information from the main KV compression. It does this by using additional multi-head queries and a shared key to handle the positional information separately. This approach provides the following benefits:
- It prevents the position encoding from interfering with the KV compression process.
- The model does not need to recompute all past keys for position adjustments.
- The model can quickly retrieve past token information without extra processing.
What are the key benefits of MLA?
- It drastically reduces memory usage due to low-rank compression for keys and values.
- Speeds up text generation: Less KV storage means faster retrieval and inference.
- Performs as well as traditional MHA: Unlike other methods like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), MLA does not sacrifice quality.
- Handles position encoding: Decoupled RoPE ensures that word order is preserved.
However, as usual, we need to discuss the limitations of MLA, as no technique is ideal.
Not without limitations
- Loss of some information due to compression: MLA may slightly weaken long-range dependencies.
- Extra computation for compression and decompression: This can slow down training and slightly affect inference.
- RoPE compatibility issues: MLA requires decoupling RoPE, adding implementation complexity.
- Compression ratio trade-offs: It needs careful tuning to balance memory savings and performance.
- Limited benchmarking on different models: MLA wasn’t tested enough across various AI architectures.
- Challenges integrating with other attention mechanisms: Needs extra tuning for models using customized attention methods.
Despite this, MLA is the part of DeepSeek’s best AI models that ensures efficient inference by substantially reducing the key-value (KV) cache size through compression into a latent vector. We have all seen the amazing results of DeepSeek-R1's performance, but let's review them for a complete picture, adding some more details from previous DeepSeek models.
MLA results
Here we can see how MLA have contributed to improving DeepSeek’s-V2.
MLA compressed the Key-Value (KV) cache into a latent vector, reducing memory requirements by 93.3%. Since less data needs to be stored and accessed, each inference step becomes faster. By lowering KV memory usage, more sequences can be processed at once, leading to higher generation throughput with 5.76× improvement compared to model without MLA. Of course, both DeepSeek’s innovations, the MLA and DeepSeekMoE architecture, contributed to the improvements, but the memory use compression strategy comes from MLA.
Image Credit: The original DeepSeek-V2 paper
The phenomenon results of DeepSeek-R1 demonstrate that slow, step-by-step reasoning does not suffer from using compression techniques like MLA. On the contrary, the model achieved reasoning performance on par with advanced OpenAI models, like like o1-1217.
Image Credit: The original DeepSeek-R1 paper
âť“This raises a questionâť“
Is it possible to further reduce the memory use of models like DeepSeek-R1 while maintaining their accuracy and improving their speed and efficiency?
What if we blend LightThinker and MLA concepts?
Both LightThinker and MLA aim to improve LLM efficiency by cutting memory use and achieving faster inference. But they target different aspects of the model's operation. While LightThinker is designed for reasoning compression, MLA optimizes attention memory usage by compressing key-value (KV) caches to reduce storage and speed up inference.
To put it simply, MLA is like zipping and unzipping stored data to save memory, while LightThinker is like summarizing a long conversation so the AI model doesn’t have to remember every detail.
And now just Imagine if models like DeepSeek-R1 incorporated a technique like LightThinker together with MLA. This combination could lead to more efficient and powerful reasoning models. Here are some suggestions how this collaboration could look like:
- Firstly, LightThinker compresses reasoning steps, storing only key information in summaries while discarding redundancies.
- Then MLA could ensure selective attention to these compressed summaries, prioritizing the most relevant latent details instead of treating all tokens equally.
This could allow MLA to act as an adaptive retrieval mechanism, improving recall of LightThinker’s compressed “thoughts”. MLA could also refine how LightThinker’s compressed steps are retrieved and expanded, striking a balance between compactness and depth of reasoning. So this collaboration of two concepts could speed up processing by selectively weighting attention across multiple latent dimensions, which are effectively compressed beforehand.
Moreover, LightThinker and MLA could mitigate each other’s limitations. For example:
- MLA can help LightThinker recover key details lost or even preventing loss of crucial details during aggressive summarization.
- As LightThinker struggles with math tasks because numerical values may be compressed incorrectly, MLA can reinforce numerical consistency in compressed thoughts by prioritizing structured retrieval.
- MLA’s ability to selectively attend to LightThinker’s compressed cache could make its compression strategy more flexible in real-world tasks.
- The MLA might slightly weaken long-range dependencies, and in this case, LightThinker’s step-by-step compression of thoughts can reduce the initial information that the MLA needs to handle.
Overall, it is interesting to explore how MLA could distribute attention between LightThinker's "notes" (compressed "thoughts"), zipping and unzipping them only when needed to save memory. Together, they can improve inference efficiency, maintain reasoning depth, and optimize memory use more.
We encourage you to try.
Conclusion
Memory compression is an essential optimization technique, especially for large reasoning models that require significant resources for extended reasoning processes. LightThinker and MLA effectively reduce memory usage and speed up inference processing. However, as AI models continue to expand their reasoning steps and inference time to achieve greater accuracy – and as the community pushes beyond the current achievements of models like O1 and DeepSeek-R1 – we cannot afford to pause in developing more effective compression techniques, as our resources are valuable.
That’s why hybrid approaches like MLA + LightThinker might be the key to more efficient memory use and faster inference. Developers and researchers, what do you think about this?
Author: Alyona Vert Author of the idea and editor: Ksenia Se
Bonus: Resources to dive deeper
- LightThinker: Thinking Step-by-Step Compression by Jintian Zhang, Shuofei Qiao, Huajun Chen, Ningyu Zhang et al.
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by Bo Liu, Damai Dai et al.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by Daya Guo, Qihao Zhu et al.
- Model Compression and Efficient Inference for Large Language Models: A Survey
- Efficient Transformers: A Survey
- Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
- FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework
- Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning by Orion Weller
Turing Post resources
📨 If you want to receive our articles straight to your inbox, please subscribe here