Topic 33: Slim Attention, KArAt, XAttention and Multi-Token Attention Explained – What’s Really Changing in Transformers?
🔳 We explore four advanced attention mechanisms which improve how models handle long sequences, cut memory use and make attention learnable
Attention in AI is a fundamental technique which will always remain a hot topic as we continue to work with architectures like transformers. Attention mechanisms give us a peek into what the model is focusing on when making decisions. They allow models to dynamically focus on specific parts of their input, and researchers are trying to use attention weights for interpretability and for figuring out why a model made a choice.
Two types of attention being core mechanisms of transformers, once revolutionized the effectiveness of AI models: 1) Self-Attention which lets each token “look” at all others in a sequence to understand context, and 2) Multi-Head Attention (MHA) that runs multiple attention mechanisms in parallel to capture different types of relationships. Now they are foundational to all major current LLMs, such as GPT, BERT, T5, and LLaMA.
Another notable attention mechanism is DeepSeek’s Multi-Head Latent Attention (MLA), which we covered in one of our previous episodes. It goes further than MHA and allows us to reduce memory use by modifying the MHA mechanism to compress the KV cache into a much smaller form.
Even these example show that new attention techniques = new possibilities and capabilities. They also open up possibilities for steering or guiding generation. Recently, we have observed that researchers are increasingly focusing on attention in AI. This gives us a hint that the community is seeking new mechanisms to take the models we use daily to the next level. Today we are going to dive into four latest diverse attentions: 1) Slim Attention which allows to process long context faster and cut memory use up to 32 times; 2) XAttention improving the effectiveness of sparse attention in long sequences including text and videos; 3) Kolmogorov-Arnold Attention (KArAt and Fourier-KArAt), a completely different approach, that focuses on making attention learnable and adaptable; plus 4) Multi-Token Attention which allows words to "team up" to decide what’s important. How they work? What models can benefit from them? Did we get your attention? Good. Let’s begin. You’ll learn a lot!
📨 Click follow! If you want to receive our articles straight to your inbox, please subscribe here
What is Slim Attention?
Working with long context remains a serious challenge for all LLMs, which also takes up a lot of memory and slows everything down – especially when generating new tokens during inference. Researchers from OpenMachine decided to overcome this issue, focusing on attention mechanism, and proposed Slim Attention. This technique allows to get the same results as using MHA, but faster and with less memory, which is pretty cool for scaling up large models. For example, in models like Whisper, Slim Attention can reduce memory use by 8 times and make text generation up to 5 times faster when running big batches. It some cases, like with T5-11B model, it can even cut memory use by 32 times!
Let’s look more precisely, how Slim Attention achieves these impressive results →
How does Slim Attention work?
The key lies in using the same math as in MHA, but with more efficient memory use. When models process long chunks of text, they need to remember a lot of stuff – specifically the K (keys) and V (values) — in what's called the KV cache, which can get huge.
Normally, when transformers process text, they store both keys (K) and values (V) in memory. But Slim Attention found a clever trick.
Since K and V come from the same input and use square matrices, it can reverse the math and rebuild V from K instead of storing both that gives exactly the same result. There are two options to rebuild V from K:
- Recalculate V from K, and then do the attention math. It’s simple but requires more compute.
- Do the attention math first, then apply the V-from-K transformation. It’s more efficient and fast, especially when generating one token at a time, but is slightly more complex to implement.
Option 2 is usually better during generation, because it saves compute while still using less memory.
Image Credit: Slim Attention original paper
So instead of storing both K and V, Slim Attention just stores K, and recalculates V when needed. This cuts memory use in half or lets the model double how much text it can “see” at once.
Researchers also solved a potential issue with applying RoPE (rotary positional embeddings), which helps the model understand where each token is placed in the sequence. The thing is that RoPE is only applied to Q (queries) and K (keys) — not to V (values). That’s why to rebuild V from K with Slim Attention, you may need to undo RoPE first. There are two options to do this depending on whether we're using sparsity tricks:
- If we don’t use sparsity, we store raw K in the cache and apply RoPE later, after reading K from memory. Here raw K is used to calculate V.
- If we use sparsity, we store RoPE’d K, but undo it for the few K-vectors which are actually used. It's called RoPE decoding, and luckily, it is simple and uses the same sine/cosine math as encoding.
What does Slim attention achieve thanks to it smart and simple approach to KV?
Advantages of Slim Attention
- It makes models faster and reduces memory needs in general by half without hurting accuracy – it is especially helpful for long inputs and large batch sizes.
- For one input (batch size 1), it can make things up to 1.8x faster.
- For many inputs (batch size 16), the speedup can hit 2x.
- And, as we have already mentioned, in models like Whisper, it can reduce memory use by 8x and make it up to 5x faster. With T5-11B, it can cut memory use by 32x.
- It is cheaper to run.
- Slim attention is easy to add to existing models that use multi-head attention (MHA), such as CodeLlama, Aya, Phi-3-mini, SmolLM2, Vision-language models like LLAVA, audio-text models like Qwen2-Audio, and encoder-decoder models like Whisper and T5. No retraining is needed.
- Slim Attention can easily support biases if they are used in models.
Despite these impressive advantages, Slim Attention also has some weak sides, that might not be obvious.
Not without limitations
- While saving memory, Slim Attention spends a bit more compute to reconstruct V on the fly. On systems where compute is the bottleneck rather than memory, this might slow things down instead of helping.
- It doesn’t work with some newer variations of attention like MQA (multi-query attention) or GQA (grouped query attention), which many newer LLMs use for efficiency.
- RoPE handling adds complexity, especially if using sparsity tricks.
- It gives less benefit for small models and inputs, as memory use is already low.
- Slim Attention is not ideal for non-square weight matrices – it’s when models (like T5-11B) have projection weights where output size > input size. Inverting these big matrices is harder or more expensive.
- Doesn’t work with all positional encoding methods.
In summary, Slim Attention is a perspective replacement for MHA in cases when memory use is the main bottleneck and we need faster token processing.
Handling long contexts and diverse data – like video, images, and audio – is a key challenge for today’s LLMs. New attention mechanisms are emerging to tackle it. The next one we explore takes a different approach.
What is XAttention?
The main idea of XAttention
One way to speed up processing of any types of long inputs in AI models is called block-sparse attention. It focuses only on the most important parts of the input, instead of checking every possible connection. But figuring out which parts are important usually takes a lot of time and effort too, which cancels out the speed benefit.
That’s where XAttention comes in to make block-sparse attention much more efficient. It is introduced by researchers from Tsinghua University, Massachusetts Institute of Technology, SJTU and NVIDIA. Instead of doing heavy calculations to find important parts, it “looks” at something simple: the sum of values along diagonal lines in the attention matrix (from bottom-left to top-right). This turns out to be a great shortcut for spotting what matters most.
Quickly estimating which parts are important, XAttention runs much faster and in some cases achieves even 13.5 times higher speed, remaining the same accuracy as regular attention. Let’s look at how it exactly works step-by-step.
How does XAttention work?
- XAttention starts with predicting importance. Its key idea is to focus on antidiagonals – the lines running from bottom-left to top-right within each block. By summing values along these lines, the model quickly estimates how important a block is.
Image Credit: XAttention original paper
Take the image above as an example: the input is divided into 8×8 blocks. Red lines show higher summed values (more important), blue lines lower ones (less important). But why does this work? Because every token contributes to at least one antidiagonal. Plus, high-impact patterns – like verticals or slashes –tend to intersect with these lines. This gives XAttention a reliable, fast signal for prioritization.
Image Credit: XAttention original paper
- Choosing which blocks to keep Once the blocks are scored using antidiagonals:
- Their scores are normalized using softmax, which turns them into probabilities.
- Based on these scores, the model “picks” only the high-scoring blocks and performs attention on these selected blocks (marked in red in the first image above). It’s only the smallest set of blocks whose combined importance scores pass a certain threshold.
- Tuning the threshold for each attention head Different parts of the model, called attention heads, behave differently – some need more detail, others can be sparser. XAttention uses a dynamic programming strategy to adjust the threshold for each head individually. It slowly lowers the threshold by 10% at each step to find the right balance between speed and accuracy. This part helps fine-tune performance even more, but it’s optional. XAttention still works great without it.
Now it's important to analyze what this attention mechanism allows models to achieve, along with its main advantages.
Benefits of XAttention performance
XAttention demonstrates impressive performance improvements compared to other sparse attention mechanisms, and in some cases, it achieves scores on par with full attention.
- Efficiency and speed: It delivers up to 13.5x faster attention computation on long sequences, and also outperforms other methods for shorter inputs due to lighter pattern. This greatly reduces memory and computation cost.
- Lightweight design: XAttention’s pattern selection is up to 24.9x faster.
- Improved text understanding: It outperforms other sparse attention methods on long-text tasks, staying accurate even at very long sequence lengths.
- On RULER, it can even beat full attention.
- On LongBench, it has the highest average accuracy across tasks.
Image Credit: XAttention original paper
- Top video understanding: XAttention can handle videos up to 1 hour long with 1 frame per second, beating other sparse attention methods and even FlashAttention.
- Video generation efficiency: It is very close to full attention, with over 50% computation savings.
- Plug-and-play: XAttention can be easily added to existing Transformer models without major changes.
- Scalability: Designed for long-context Transformers (LCTMs), it helps models scale to 256k+ tokens or hour-long videos.
Overall, XAttention makes it much easier to use powerful models in the real world without breaking the bank on compute. Its lightweight design and plug-and-play nature make it practical for real-world AI systems – especially those involving multimodal AI. However, it still has some issues we should take into account.
Limitations to be aware of
- The antidiagonal scoring is a clever shortcut, but it's still an approximation. In some rare cases, it might miss important patterns.
- Trade-off between sparsity and accuracy: Choosing too high a threshold or too large a stride can make the attention too sparse, leading to loss of accuracy.
- Warmup required for video generation: Applying XAttention too early in the video generation process caused layout issues. This can be fixed by adding a "warmup phase" using full attention for first 5 steps. However, this adds complexity to integration.
- Designed for block structures: XAttention structure might not be optimal for irregular or highly dynamic inputs.
- While it's plug-and-play for many models, it's not directly tested with every Transformer variant. Models using non-standard attention formats may require adjustments.
Despite these issues, XAttention is proved to be a powerful upgrade for sparse attention mechanism with beneficial performance on long sequences.
Now, it’s time to move to the third attention type – Kolmogorov-Arnold Attention, and explore what it offers to make attention mechanism better.
What is Kolmogorov-Arnold Attention (KArAt)?
The idea behind applying KAN in attention mechanism
Researchers from the University of Central Florida took a completely another approach which is a rather philosophical one – they aimed to explore how Kolmogorov-Arnold Networks (KANs) could enhance Transformers in vision tasks, not by being more efficient, but by being smarter and more flexible in how they learn. Their main question is: What if attention itself could learn?
Just a quick reminder about KANs: Instead of multiplying inputs by weights, KANs apply learned spline functions to each input dimension. Then, just like MLPs, they sum those transformed values and pass them to the next layer. This helps KANs understand more complex patterns in data. After training, researchers can look inside the model to see what it learned, almost like peeking into its “thought” process.
KANs have shown strong results in tasks like discovering math formulas and learning simple one-dimensional patterns over time. However, there hasn’t been much testing of combining KANs with more powerful architectures like transformers.
To explore whether a learnable attention mechanism, based on KANs, could replace traditional unadaptable softmax in Vision Transformers (ViTs) and do a better job, researchers created a version called Kolmogorov-Arnold Attention, KArAt, and later a more practical, modular version using Fourier features, called Fourier-KArAt. Let’s look at how these mechanisms work.
How does KArAt work?
In a Vision Transformer (ViT), each attention head “looks” at the input and “decides” which parts to pay attention to, building an attention matrix, where each row represents how much attention one part of the input gives to all the others. Each row is run through a softmax function, which turns the rows of raw numbers into probabilities (all the numbers become positive and they add up to 1).
KArAt replaces softmax with custom, learnable activation functions using KANs. As a result, the model learns its own transformation from data.
This approach is inspired by Kolmogorov-Arnold theorem, which explains how complex functions can be broken down into simpler ones. Here, KArAt builds functions using basis functions, which are like math building blocks that could be:
- Fourier basis: combinations of sine and cosine waves.
- B-splines: smooth, curve-fitting functions.
- Wavelets or fractals for more specialized cases.
Image Credit: Kolmogorov-Arnold Attention original paper
Each attention row is passed through one of these custom functions, made by mixing different basis functions. The model learns how much of each basis function to use – and these are the parameters it tunes during training.
But the output might not behave like probabilities. So special ℓ1 projection fixes that by making values positive and making them sum to 1. It also ensures attention still makes sense.
But here comes the real-world issue: it’s extremely expensive to compute. The solution is to make the actual architecture smaller and smarter, and it’s Fourier-KArAt.
How does Fourier-KArAt work?
Attention matrices often have a low-rank structure, that’s why they don’t need to use all dimensions to represent important information. So, researchers offered to shrink the size of the learnable operator:
- Instead of using a full matrix with N × N values, we can use a much smaller r × N matrix (where r is also much smaller than N).
- This smaller matrix compresses the attention data into a lower-dimensional space.
- Then, another matrix projects it back to its original size. This trick dramatically reduces memory use.
This new lightweight version is called Fourier-KArAt (Kolmogorov-Arnold Attention with Fourier basis).
Why are Fourier basis functions better than other options? It’s simple. They can represent any smooth periodic function, are GPU-friendly and scale better.
Fourier-KArAt can be used in real ViT models and its learnable attention modules can be configured in two ways:
Image Credit: Kolmogorov-Arnold Attention original paper
- Blockwise: Each layer (block) in the transformer has its own unique attention functions. This captures more detailed patterns but uses more parameters.
- Universal: All layers share the same attention functions, which keeps the model smaller and may work better for simpler tasks.
However, these two setups trade off flexibility for efficiency, depending on the task. But are there any performance benefits of Fourier-KArAt?
Advantages of Fourier-KArAt
Here are the main scenarios where Fourier-KArAt shows improvement:
- Better accuracy on small models: In ViT-Tiny, Fourier-KArAt boosts accuracy without needing extra tricks:
- +5.4% on CIFAR-10
- +7.4% on CIFAR-100
- Sharper attention maps: Visualizations show that Fourier KArAt tends to focus more precisely on the important parts of an image, helping the model make clearer decisions.
Let’s also summarize its overall benefits:
- Learnable attention: Fourier-KArAt gives the model more flexibility to adapt how it focuses on input data.
- Smooth approximation with Fourier: Using sine and cosine functions helps the model smoothly approximate softmax-like behavior while remaining trainable and GPU-friendly.
- Reduced computation via low-rank design: Instead of heavy full-sized operations, it uses a lower-dimensional version of the attention mechanism, cutting down memory and compute cost.
- Flexible to different architectures: The design can potentially work with any mathematical basis, making it easy to experiment with different attention behaviors in the future.
However, Fourier-KArAt does not perform as well as we would like it to.
Not without limitations
- No clear benefit in larger models: In ViT-Small and ViT-Base, the accuracy improvements are minimal or inconsistent, sometimes even worse than the vanilla model.
- Fourier-KArAt is more complex to implement than regular attention: Interpreting the attention maps is more difficult, and visualizations may require extra tweaking.
- It adds a lot of extra parameters, which can make the model harder to train and tune, especially in deeper or bigger architectures like ViT-Base.
- Fourier-KArAt can create less stable training path, especially in large model.
- While it learns quickly in the beginning, the training slows down in later stages.
- High memory use: Training with Fourier-KArAt can consume a lot of GPU memory, even up to 60 GB for a single image in some cases.
Fourier-KArAt currently has more limitations than benefits, but it presents a fresh and alternative approach to attention with huge potential. Like many other developments, Fourier-KArAt needs more time and experimentation to reach a higher level of quality. This will enable it to fully use its existing benefits and achieve new ones.
Multi-Token Attention (MTA)
Recently researchers from FAIR at Meta introduced another advanced attention mechanism - Multi-Token Attention (MTA), which is especially effective in long context tasks. Instead of comparing just one "query" word to one "key" word at a time, MTA lets the model consider small groups of nearby words together. This attention information is also shared between different attention heads. Here is how MTA releases this concept.
How does MTA work?
MTA implements convolution technique from image processing where nearby pixels influence each other. But in case of MTA it's tokens instead of pixels. MTA uses this technique of two levels:
Image Credit: MTA original paper
- Key-Query convolution:
The model look at groups of nearby words when deciding what to pay attention to. It's like a sliding window that allows model to look at words together rather than separately.
For example, if a model has to answer the question “Where did Alice see the rabbit?”, it needs to find a sentence that mentions both “Alice” and “rabbit”. With key-query convolution, a model can combine the tokens for “Alice” and “rabbit” to figure out where they appear together in a sentence, while standard attention can only find where they appear separately.
There are two ways to apply this process:Before softmax:
- Before the model picks which words are important (before softmax).
- After it picks them (after softmax).
- Head mixing convolution
MTA lets attention heads share what they’ve found. For example, one head looking for "Alice" and another looking for "rabbit" can combine forces, so head group figures out where both appear together. This mixing can be done before or after softmax.
To improve learning and avoid training issues, MTA uses group normalization with depth scaling. Each attention head in MTA applies its own group normalization, instead of sharing one across the whole layer. The scaling factor changes based on the depth of the layer, so, at deeper layers, the model uses stronger normalization to keep things stable. This helps control how much normalization affects learning at each step.
Benefits of MTA
- MTA is better at combining clues to find the right answer.
- More contextual understanding: MTA allows attention to depend on multiple tokens at once, not just one key-query pair, so it can better locate complex or composite information. It improves language understanding without needing extra parameters.
- Combining strength across heads leads to better reasoning.
- Performs well even when there are multiple hidden facts.
- Learns attention patterns automatically.
- Is flexible and can be tuned for different tradeoffs like speed vs. accuracy.
- Improved accuracy in long contexts: MTA performs better on tasks that require precise retrieval from long texts.
- Solves complex attention tasks: In toy tasks where standard transformers fail (like identifying blocks with multiple target letters), MTA succeeds with zero error.
- Small overhead: The performance improvements come with very little increase in parameters (only ~0.001%)—so it’s efficient in terms of model size.
Image Credit: MTA original paper
Not without limitations
Despite many advantages, there are several serious limitation of MTA:
- MTA does not work with popular optimized attention kernels like FlashAttention, so it's slower and less efficient to train in current frameworks.
- Due to convolutions over multiple dimensions (queries, keys, heads), MTA uses significantly more GPU memory than standard attention.
- Lower Throughput: Training with MTA is also slower in terms of tokens per second, mainly because it doesn’t benefit from hardware-level optimizations yet.
- Cost and complexity tradeoff: The convolutions and normalization introduce additional complexity.
Comparison of four attention types
Today, we took a deep dive into four novel approaches to attention mechanisms. Each improves different aspects of attention to boost the performance of Transformer-based models. But which approach is best to use depends on your specific goal:
- If you need to significantly cut memory usage while handling long context – Slim Attention is your best option.
- When you need effective and really fast processing of long sequences, including videos – XAttention, which is simpler and 13.5x faster than traditional sparse attention, is a great choice.
- If you want an attention mechanism that can learn, adapt, and be flexible – Fourier-KarAt using KANs is a promising solution.
- When the purpose is to solve complex tasks, especially with long context, and get benefits of collaboration of multiple attention heads – Multi-token attention is a great option too.
Maybe one of these new mechanisms, or even all of them, will help elevate the efficiency of LLMs' attention to the next level in the future, once self-attention has achieved this. Anyway, attention mechanisms will continue to evolve alongside AI models.
We also invite you to explore other interesting attention mechanisms featured in our post:
Author: Alyona Vert Editor: Ksenia Se
Sources and further reading
Resources for further reading:
- Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA
- Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?
- XAttention: Block Sparse Attention with Antidiagonal Scoring
- Multi-Token Attention
- Attention Is All You Need
- Effective Approaches to Attention-based Neural Machine Translation
- Neural Machine Translation by Jointly Learning to Align and Translate
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Turing Post resources
- Topic 6: What is KAN?
- How to Reduce Memory Use in Reasoning Models
- Token 1.6: Transformer and Diffusion-Based Foundation Models
📨 If you want to receive our articles straight to your inbox, please subscribe here