Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming Paper • 2501.18837 • Published 14 days ago • 9
Sparse Autoencoders Find Highly Interpretable Features in Language Models Paper • 2309.08600 • Published Sep 15, 2023 • 13