Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
Abstract
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
Community
We introduce a data-free cosine-based approach to align Sparse Autoencoder (SAE) features across residual, MLP, and attention modules in every layer, forming “flow graphs” that map how semantic directions originate, propagate, or vanish through the model. These graphs not only unveil the multi-layer circuits of feature transformations but also enable more fine-grained and effective steering of model outputs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models (2025)
- Steering Large Language Models with Feature Guided Activation Additions (2025)
- Tracking the Feature Dynamics in LLM Training: A Mechanistic Study (2024)
- Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment (2025)
- Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words (2025)
- Modular Training of Neural Networks aids Interpretability (2025)
- Does Representation Matter? Exploring Intermediate Layers in Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper