Papers
arxiv:2403.03329

Guardrail Baselines for Unlearning in LLMs

Published on Mar 5
Authors:
,
,
,

Abstract

Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2403.03329 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.03329 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.03329 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.