Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Paper • 2405.10927 • Published May 17, 2024 • 3
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training Paper • 2403.05030 • Published Mar 8, 2024
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-stronger-unlearning Text Generation • Updated Jul 22, 2024 • 15 • 1