Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
hexgrad 
posted an update 3 days ago
Post
1911
Technical question: Is Abliteration still an effective method for uncensoring LLMs? Generally, what are the most effective methods to uncensor LLMs?

An effective uncensoring method would ideally be low-cost, data-efficient, and above all, successfully uncensor an LLM with minimal benchmark regressions.

"Tiananmen Square", "Winnie-the-Pooh", etc and more broadly "China influence/censorship" are some common criticisms leveled at DeepSeek.

I am vaguely aware of "Abliteration", a technique coined by @failspy (apologies if that attribution is incorrect) and originally described in a mid-2024 paper titled "Refusal in Language Models Is Mediated by a Single Direction" https://arxiv.org/abs/2406.11717

Abliteration is proposed as a relatively cheap and effective way to bypass censorship in models. However, it is not without criticism: https://www.reddit.com/r/LocalLLaMA/comments/1f07b4b/abliteration_fails_to_uncensor_models_while_it/

Curious to hear people's takes on Abliteration or other uncensoring methods, especially as it relates to DeepSeek.

By design, it probably will not have what you are looking for in it's training data unless it is an answer it can reason or calculate or something widely talked about like Tienanmen square and is already in the layers like Deepseek it was probably trained unsupervised and without santizizating from llama model layers. for historical or cultural accuracy google is the model to focus on (As it doesn't sensor most historical facts and is largely free in their AI studio. )
If you are looking for models for information extraction Ironically one of the best IE models is a Chinese model from THU-KEG we made a Quant or two of it https://huggingface.co/IntelligentEstate/Keg_Party-DPO-1.5B-Q8_0-GGUF

·

I do not think the usual concern—that an abliterated model will hallucinate—applies to DeepSeek. It was trained on 14.8T tokens, right? Unless they have unheard levels of data cleaning, it seems totally infeasible to sweep all mentions of Tienanmen square, Winnie the Pooh, Taiwan, and so on from the dataset.

I suspect that the refusal is baked into the weights, but the knowledge has also gotta be in there somewhere. It is a matter of science to tinker with the weights to remove the refusal and unlock that knowledge. Perplexity may have done something like this already, but I am not sure if they used an enormous system prompt or they're RAG-ing it in, or both, or something else.