Aligning Teacher with Student Preferences for Tailored Training Data Generation Paper • 2406.19227 • Published Jun 27 • 24
Self-Play Preference Optimization for Language Model Alignment Paper • 2405.00675 • Published May 1 • 25
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues Paper • 2404.03820 • Published Apr 4 • 24
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning Paper • 2407.00617 • Published Jun 30 • 7
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI Paper • 2407.00106 • Published Jun 27 • 5
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Paper • 2406.12624 • Published Jun 18 • 36
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Paper • 2406.18495 • Published Jun 26 • 12
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models Paper • 2406.18510 • Published Jun 26 • 8
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces Paper • 2406.11614 • Published Jun 17 • 4
Large Language Model Unlearning via Embedding-Corrupted Prompts Paper • 2406.07933 • Published Jun 12 • 7
Deep Bayesian Active Learning for Preference Modeling in Large Language Models Paper • 2406.10023 • Published Jun 14 • 2
Transforming and Combining Rewards for Aligning Large Language Models Paper • 2402.00742 • Published Feb 1 • 11
LongAlign: A Recipe for Long Context Alignment of Large Language Models Paper • 2401.18058 • Published Jan 31 • 20
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs Paper • 2407.10058 • Published Jul 14 • 29
To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models Paper • 2407.01920 • Published Jul 2 • 13
The Art of Saying No: Contextual Noncompliance in Language Models Paper • 2407.12043 • Published Jul 2 • 4
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation Paper • 2410.09584 • Published Oct 12 • 47