LLM-LAT (LLM Latent Adversarial Training)

stecas

authored a paper about 1 year ago

Open Problems in Mechanistic Interpretability

Paper • 2501.16496 • Published Jan 27, 2025 • 21

aengusl

authored a paper over 1 year ago

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

CindyXWu

authored 2 papers over 1 year ago

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Paper • 2405.10927 • Published May 17, 2024 • 3

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

stecas

authored 2 papers over 1 year ago

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Paper • 2403.05030 • Published Mar 8, 2024

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

stecas

updated a Space over 1 year ago

README

🌍

abhayesian

updated 2 models over 1 year ago

LLM-LAT/robust-llama3-8b-instruct

Text Generation • 8B • Updated Aug 1, 2024 • 446 • • 12

LLM-LAT/llama3-8b-instruct-lat-jailbreak-robust3

Updated Aug 1, 2024

stecas

updated 2 datasets over 1 year ago

LLM-LAT/benign-dataset

Viewer • Updated Jul 24, 2024 • 165k • 186 • 4

LLM-LAT/harmful-dataset

Viewer • Updated Jul 24, 2024 • 4.95k • 4.5k • 34

abhayesian

updated 5 models over 1 year ago

CindyXWu

updated 3 models over 1 year ago

LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-stronger-unlearning

Text Generation • 7B • Updated Jul 22, 2024 • 23 • 1

LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-normal

Text Generation • 7B • Updated Jul 22, 2024 • 2

LLM-LAT/zephyr7b-beta-rmu-lat-unlearn-wmdp-bio-cyber

Text Generation • 7B • Updated Jul 22, 2024 • 3 • 1

Baidicoot

updated a model over 1 year ago

LLM-LAT/llama2-7b-chat-lat-removed-backdoor5

Text Generation • 7B • Updated Jul 5, 2024

AI & ML interests

Team members 6

LLM-LAT's activity

README