Undi95
/

Phi4-abliterated

Model card Files Files and versions Community

Phi4-abliterated / README.md

Undi95's picture

Create README.md

719d849 verified about 1 month ago

|

1.75 kB

Details

Phi4 Abliterated

This is Phi4 abliterated using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.

Goal

The objective is to create a model that is neutral:

Not uncensored, but avoids refusing neutral prompts it would ordinarily reject.
Enables fine-tuning for reduced censorship, starting from a neutral baseline.

Original Methodology

In the original implementation:

Harmful and harmless prompts were compared on one specific layer of the model.
The computed refusal direction was then applied to all layers.

Problem:

The resulting model:

Became less usable and somewhat "dumb."
Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.

New Approach

In my fork, available here:
👉 https://github.com/Undi95/abliteration/
(based on the original https://github.com/Orion-zhen/abliteration.git)

I introduced a new approach:

Each layer computes its own refusal direction.
The refusal direction is layer-specific, addressing the assumption that each layer has different characteristics and requirements.

Hypothesis

This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:

A more usable and intelligent model.
A neutral starting point for further fine-tuning to reduce censorship without compromising performance.

Next Steps

After applying this method, the model can be fine-tuned to:

Reduce over-censoring behavior.
Maintain neutrality while improving overall utility.