Technical question: Is Abliteration still an effective method for uncensoring LLMs? Generally, what are the most effective methods to uncensor LLMs?
An effective uncensoring method would ideally be low-cost, data-efficient, and above all, successfully uncensor an LLM with minimal benchmark regressions.
"Tiananmen Square", "Winnie-the-Pooh", etc and more broadly "China influence/censorship" are some common criticisms leveled at DeepSeek.
I am vaguely aware of "Abliteration", a technique coined by @failspy (apologies if that attribution is incorrect) and originally described in a mid-2024 paper titled "Refusal in Language Models Is Mediated by a Single Direction" https://arxiv.org/abs/2406.11717
Ok, my 14B DeepSeek R1 merge with Qwen2.5 1M is really hot right nowβit's got 2.6k downloads! It's sitting pretty as the top trending model on the third page. π₯