Ahmad Beirami's picture

Ahmad Beirami

beirami
Ā·

AI & ML interests

None yet

Recent Activity

Organizations

Google's profile picture

beirami's activity

reacted to gsarti's post with ā¤ļø about 1 year ago
view post
Post
šŸ” Today's pick in Interpretability & Analysis of LMs: Gradient-Based Language Model Red Teaming by N. Wichers, C. Denison and @beirami

This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses.

In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt.

Authors experiment with variants of GBRT aimed at inducing realistic prompts in an efficient way, and GBRT prompts are more likely to generate unsafe responses than those found by established RL-based red teaming methods. Moreover, these attacks are shown to succeed even when the LM has been fine-tuned to produce safer outputs.

šŸ“„ Paper: In-Context Language Learning: Architectures and Algorithms (2401.12973)
šŸ’» Code: https://github.com/google-research/google-research/tree/master/gbrt