Details
Phi4 Abliterated
This is Phi4 abliterated using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.
Goal
The objective is to create a model that is neutral:
- Not uncensored, but avoids refusing neutral prompts it would ordinarily reject.
- Enables fine-tuning for reduced censorship, starting from a neutral baseline.
Original Methodology
In the original implementation:
- Harmful and harmless prompts were compared on one specific layer of the model.
- The computed refusal direction was then applied to all layers.
Problem:
The resulting model:
- Became less usable and somewhat "dumb."
- Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.
New Approach
In my fork, available here:
👉 https://github.com/Undi95/abliteration/
(based on the original https://github.com/Orion-zhen/abliteration.git)
I introduced a new approach:
- Each layer computes its own refusal direction.
- The refusal direction is layer-specific, addressing the assumption that each layer has different characteristics and requirements.
Hypothesis
This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:
- A more usable and intelligent model.
- A neutral starting point for further fine-tuning to reduce censorship without compromising performance.
Next Steps
After applying this method, the model can be fine-tuned to:
- Reduce over-censoring behavior.
- Maintain neutrality while improving overall utility.