Teaching Language Models to Critique via Reinforcement Learning Paper • 2502.03492 • Published Feb 5 • 24 • 2
Jailbreaking as a Reward Misspecification Problem Paper • 2406.14393 • Published Jun 20, 2024 • 13 • 2
Jailbreaking as a Reward Misspecification Problem Paper • 2406.14393 • Published Jun 20, 2024 • 13 • 2