Phi4-abliterated / README.md
Undi95's picture
Create README.md
719d849 verified
|
raw
history blame
1.75 kB

Details

Phi4 Abliterated

This is Phi4 abliterated using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.

Goal

The objective is to create a model that is neutral:

  • Not uncensored, but avoids refusing neutral prompts it would ordinarily reject.
  • Enables fine-tuning for reduced censorship, starting from a neutral baseline.

Original Methodology

In the original implementation:

  1. Harmful and harmless prompts were compared on one specific layer of the model.
  2. The computed refusal direction was then applied to all layers.

Problem:

The resulting model:

  • Became less usable and somewhat "dumb."
  • Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.

New Approach

In my fork, available here:
👉 https://github.com/Undi95/abliteration/
(based on the original https://github.com/Orion-zhen/abliteration.git)

I introduced a new approach:

  • Each layer computes its own refusal direction.
  • The refusal direction is layer-specific, addressing the assumption that each layer has different characteristics and requirements.

Hypothesis

This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:

  • A more usable and intelligent model.
  • A neutral starting point for further fine-tuning to reduce censorship without compromising performance.

Next Steps

After applying this method, the model can be fine-tuned to:

  • Reduce over-censoring behavior.
  • Maintain neutrality while improving overall utility.