Ahmad Beirami's picture

Ahmad Beirami

beirami

·

http://www.mit.edu/~beirami/

AI & ML interests

None yet

Recent Activity

authored a paper 16 days ago

Towards Robust Prompts on Vision-Language Models

authored a paper 16 days ago

Enhancing Group Fairness in Online Settings Using Oblique Decision Forests

authored a paper 16 days ago

Situated and Interactive Multimodal Conversations

View all activity

Organizations

beirami's activity

authored 7 papers 16 days ago

Towards Robust Prompts on Vision-Language Models

Paper • 2304.08479 • Published Apr 17, 2023

Enhancing Group Fairness in Online Settings Using Oblique Decision Forests

Paper • 2310.11401 • Published Oct 17, 2023

Situated and Interactive Multimodal Conversations

Paper • 2006.01460 • Published Jun 2, 2020

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

Paper • 2307.12980 • Published Jul 24, 2023 • 1

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Paper • 2411.18688 • Published Nov 27, 2024

InfAlign: Inference-aware language model alignment

Paper • 2412.19792 • Published Dec 27, 2024 • 1

Data-augmented phrase-level alignment for mitigating object hallucination

Paper • 2405.18654 • Published May 28, 2024

authored a paper about 1 month ago

Theoretical guarantees on the best-of-n alignment policy

Paper • 2401.01879 • Published Jan 3, 2024

authored a paper 8 months ago

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Paper • 2406.05946 • Published Jun 10, 2024

authored a paper 11 months ago

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Paper • 2404.12318 • Published Apr 18, 2024 • 15

reacted to gsarti's post with ❤️ about 1 year ago

Post

🔍 Today's pick in Interpretability & Analysis of LMs: Gradient-Based Language Model Red Teaming by N. Wichers, C. Denison and @beirami

This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses.

In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt.

Authors experiment with variants of GBRT aimed at inducing realistic prompts in an efficient way, and GBRT prompts are more likely to generate unsafe responses than those found by established RL-based red teaming methods. Moreover, these attacks are shown to succeed even when the LM has been fine-tuned to produce safer outputs.

📄 Paper: In-Context Language Learning: Architectures and Algorithms (2401.12973)
💻 Code: https://github.com/google-research/google-research/tree/master/gbrt

authored 2 papers about 1 year ago

Gradient-Based Language Model Red Teaming

Paper • 2401.16656 • Published Jan 30, 2024 • 1

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Paper • 2312.09244 • Published Dec 14, 2023 • 11

authored a paper over 1 year ago

Controlled Decoding from Language Models

Paper • 2310.17022 • Published Oct 25, 2023 • 15