Spaces:

open-concept-steering
/

README

Running

App Files Files Community

README / README.md

hbfreed

Update README.md

2ace760 verified 14 days ago

preview code

raw

history blame contribute delete

1.93 kB

	---
	license: mit
	title: Open Concept Steering
	sdk: static
	emoji: 👁
	colorFrom: indigo
	colorTo: indigo
	short_description: Training SAEs
	---
	# Open Concept Steering

	Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.

	## Features

	- Universal Model Support: Train SAEs on any HuggingFace transformer model
	- Feature Discovery: Find interpretable features representing specific concepts
	- Concept Steering: Amplify or suppress discovered features to influence model behavior
	- Interactive Chat: Chat with models while manipulating their internal features

	## Pre-trained Models

	We provide pre-trained SAEs and discovered features for popular models on HuggingFace:

	Each model repository includes:
	- Trained SAE weights
	- Catalog of discovered interpretable features
	- Example steering configurations
	- Performance benchmarks


	## Quick Start

	## Examples

	See the `examples/` directory for detailed notebooks demonstrating:
	- Training SAEs on different models
	- Finding and analyzing features
	- Steering model behavior
	- Interactive chat sessions

	## License

	This project is licensed under the MIT License.

	## Citation

	If you feel compelled to cite this library in your work, feel free to do so however you please.

	## Acknowledgments

	This project builds upon the work described in ["Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"](https://transformer-circuits.pub/2024/scaling-monosemanticity/) by Anthropic, and this project absolutely would not have been possible without it.