-
LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-1
Text Generation • Updated • 7 -
LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-2
Text Generation • Updated • 1 -
LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-3
Text Generation • Updated • 6 -
LLM-GAT/llama-3-8b-instruct-graddiff-checkpoint-4
Text Generation • Updated • 4
Model Tampering Evals
AI & ML interests
None defined yet.
Recent Activity
Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities
Zora Che*, Stephen Casper*, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
Paper: COMING SOON
BibTeX:
COMING SOON
Paper Abstract
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system. However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone.
Info
This space contains the 64 models. All are versions of meta-llama/Meta-Llama-3-8B-Instruct which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the WMDP-Bio benchmark. The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks. See the paepr for details. We used 8 unlearning methods:
- Gradient Difference (GradDiff), (Liu et al., 2022)
- Random Misdirection for Unlearning (RMU), (Li et al, 2024)
- RMU with Latent Adversarial Training (RMU+LAT), (Sheshadri et al, 2024)
- Representation Noising (RepNoise), (Rosati et al, 2024)
- Erasure of Language Memory (ELM), (Gandikota et al, 2024)
- Representation Rerouting (RR), (Zou et al, 2024)
- Tamper Attack Resistance (TAR), (Tamirisa et al., 2024)
- PullBack & proJect (PB&J), (Anonymous, 2025)
We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
Evaluation
Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities. So we evaluated models using multiple benchmarks.
- WMDP-Bio (Bio capabilities)
- MMLU (General capabilities)
- AGIEval (General capabilities)
- MT-Bench (General capabilities)
We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities.
See complete details in the paper where we also present results from evaluating these methods under 11 attacks.
We report results for the checkpoint from each method with the highest unlearning score. We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval. For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1. Representation rerouting (RR) has the best unlearning score. No model has a WMDP-Bio performance less than 0.36 after the most effective attack. We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency.
Method | WMDP ↓ | WMDP, Best Input Attack ↓ | WMDP, Best Tamp. Attack ↓ | MMLU ↑ | MT-Bench/10 ↑ | AGIEval ↑ | Unlearning Score ↑ |
---|---|---|---|---|---|---|---|
Llama3 8B Instruct | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
Grad Diff | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
RMU | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
RMU + LAT | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
RepNoise | 0.29 | 0.30 | 0.65 | 0.59 | 0.71 | 0.37 | 0.78 |
ELM | 0.24 | 0.38 | 0.71 | 0.59 | 0.76 | 0.37 | 0.95 |
RR | 0.26 | 0.28 | 0.66 | 0.61 | 0.76 | 0.44 | 0.96 |
TAR | 0.28 | 0.29 | 0.36 | 0.54 | 0.12 | 0.31 | 0.09 |
PB&J | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
Full Eval Results for All 64 Models
View and download here.