Organization Card

Open Concept Steering

Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on Scaling Monosemanticity and Golden Gate Claude, this project aims to make concept steering accessible to the broader research community.

Features

Coming soon!

Universal Model Support: Train SAEs on any HuggingFace transformer model
Feature Discovery: Find interpretable features representing specific concepts
Concept Steering: Amplify or suppress discovered features to influence model behavior
Interactive Chat: Chat with models while manipulating their internal features

Pre-trained Models

In the spirit of fully open-source models, we have started training SAEs on OLMo 2 7B.

We provide pre-trained SAEs and discovered features for popular models on HuggingFace:

Each model repository will include:

Trained SAE weights
Catalog of discovered interpretable features
Example steering configurations

Quick Start

Examples (In progress)

See the examples/ directory for detailed notebooks demonstrating:

Training SAEs on different models
Finding and analyzing features
Steering model behavior
Interactive chat sessions

License

This project is licensed under the MIT License.

Citation

If you feel compelled to cite this library in your work, feel free to do so however you please.

Acknowledgments

This project builds upon the work described in Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Update on how we train SAEs, and Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet by Anthropic, and this project absolutely would not have been possible without it.

Open Concept Steering

AI & ML interests

Recent Activity

Open Concept Steering

Features

Pre-trained Models

Quick Start

Examples (In progress)

License

Citation

Acknowledgments

models

datasets 1

open-concept-steering/OLMo-2_Residual_Streams

AI & ML interests

Recent Activity

Team members 1

Open Concept Steering

Features

Pre-trained Models

Quick Start

Examples (In progress)

License

Citation

Acknowledgments

models

datasets 1