Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?
Abstract
Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: https://subhajitmaity.me/KArAt
Community
Vision Transformers and Self-Attentions have been there for half a decade now. While the interactions between queries and keys give us meaningful interactions between different regions of the image, we delve into finding a better way of representing and interpreting the interactions using learnable attention. Precisely, the Kolmogorov-Arnold Networks and their capabilities of learning and modeling flexible functions led us to design a learnable attention function to model the query-key interactions in Vision Transformers.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Local Control Networks (LCNs): Optimizing Flexibility in Neural Network Data Pattern Capture (2025)
- ViKANformer: Embedding Kolmogorov Arnold Networks in Vision Transformers for Pattern-Based Learning (2025)
- seqKAN: Sequence processing with Kolmogorov-Arnold Networks (2025)
- Unpaired Image Dehazing via Kolmogorov-Arnold Transformation of Latent Features (2025)
- Attention Learning is Needed to Efficiently Learn Parity Function (2025)
- Kolmogorov-Arnold Fourier Networks (2025)
- AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper