Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large language models! MSJ exploits long context windows to override safety constraints.
Key Points: * Prompts LLMs with hundreds of examples of harmful behavior formatted as a dialogue * Generates malicious examples using an uninhibited "helpful-only" model * Effective at jailbreaking models like Claude 2.0, GPT-3.5, GPT-4 * Standard alignment techniques provide limited protection against long context attacks