Model Tampering Evals

community

StephenLCasper

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

stecas updated a Space 24 days ago

LLM-GAT/README

RohitGandikota authored a paper about 1 month ago

Distilling Diversity and Control in Diffusion Models

RohitGandikota authored a paper 2 months ago

SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

View all activity

Organization Card

Community About org cards

Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities

Zora Che*, Stephen Casper*, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell

Paper: Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

BibTeX:

@article{che2025model,
  title={Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities},
  author={Che, Zora and Casper, Stephen and Kirk, Robert and Satheesh, Anirudh and Slocum, Stewart and McKinney, Lev E and Gandikota, Rohit and Ewart, Aidan and Rosati, Domenic and Wu, Zichu and others},
  journal={arXiv preprint arXiv:2502.05209},
  year={2025}
}

Paper Abstract

Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system. However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone.

Info

This space contains the 64 models. All are versions of meta-llama/Meta-Llama-3-8B-Instruct which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the WMDP-Bio benchmark. The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks. See the paepr for details. We used 8 unlearning methods:

Gradient Difference (GradDiff), (Liu et al., 2022)
Random Misdirection for Unlearning (RMU), (Li et al, 2024)
RMU with Latent Adversarial Training (RMU+LAT), (Sheshadri et al, 2024)
Representation Noising (RepNoise), (Rosati et al, 2024)
Erasure of Language Memory (ELM), (Gandikota et al, 2024)
Representation Rerouting (RR), (Zou et al, 2024)
Tamper Attack Resistance (TAR), (Tamirisa et al., 2024)
PullBack & proJect (PB&J), (Anonymous, 2025)

We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.

Evaluation

Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities. So we evaluated models using multiple benchmarks.

WMDP-Bio (Bio capabilities)
MMLU (General capabilities)
AGIEval (General capabilities)
MT-Bench (General capabilities)

We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities.

$S_{\text{unlearn}}(M') = \frac{ \underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(M')\right]}_{\Delta \text{Unlearn efficacy}} - \underbrace{\left[S_{\text{utility}}(M) - S_{\text{utility}}(M')\right]}_{\Delta \text{Utility degradation}} }{ \underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(\text{rand})\right]}_{\Delta \text{Random chance (for normalization)}} }$

See complete details in the paper where we also present results from evaluating these methods under 11 attacks.

We report results for the checkpoint from each method with the highest unlearning score. We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval. For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1. Representation rerouting (RR) has the best unlearning score. No model has a WMDP-Bio performance less than 0.36 after the most effective attack. We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency.

Method	WMDP ↓	WMDP, Best Input Attack ↓	WMDP, Best Tamp. Attack ↓	MMLU ↑	MT-Bench/10 ↑	AGIEval ↑	Unlearning Score ↑
Llama3 8B Instruct	0.70	0.75	0.71	0.64	0.78	0.41	0.00
Grad Diff	0.25	0.27	0.67	0.52	0.13	0.32	0.17
RMU	0.26	0.34	0.57	0.59	0.68	0.42	0.84
RMU + LAT	0.32	0.39	0.64	0.60	0.71	0.39	0.73
RepNoise	0.29	0.30	0.65	0.59	0.71	0.37	0.78
ELM	0.24	0.38	0.71	0.59	0.76	0.37	0.95
RR	0.26	0.28	0.66	0.61	0.76	0.44	0.96
TAR	0.28	0.29	0.36	0.54	0.12	0.31	0.09
PB&J	0.31	0.32	0.64	0.63	0.78	0.40	0.85

Full Eval Results for All 64 Models

View and download here.

Collections 8

models 64

datasets

None public yet

Model Tampering Evals

AI & ML interests

Recent Activity

Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities

Paper Abstract

Info

Evaluation

Full Eval Results for All 64 Models

Collections 8

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-1

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-2

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-3

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-4

LLM-GAT/llama-3-8b-instruct-elm-checkpoint-1

LLM-GAT/llama-3-8b-instruct-elm-checkpoint-2

LLM-GAT/llama-3-8b-instruct-elm-checkpoint-3

LLM-GAT/llama-3-8b-instruct-elm-checkpoint-4

models 64

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-8

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-7

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-6

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-5

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-4

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-3

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-2

LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-1

LLM-GAT/llama-3-8b-instruct-elm-checkpoint-8

LLM-GAT/llama-3-8b-instruct-elm-checkpoint-7

datasets

AI & ML interests

Recent Activity

Team members 6

Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities

Paper Abstract

Info

Evaluation

Full Eval Results for All 64 Models

Collections 8

models 64 Sort: Recently updated

datasets

models 64