# Details

# Phi4 Abliterated

This is **Phi4 abliterated** using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.

## Goal

The objective is to create a model that is **neutral**:
- **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
- Enables fine-tuning for reduced censorship, starting from a neutral baseline.

## Original Methodology

In the original implementation:
1. Harmful and harmless prompts were compared on **one specific layer** of the model.
2. The computed refusal direction was then applied to **all layers**.

### Problem:
The resulting model:
- Became **less usable** and somewhat "dumb."
- Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.

## New Approach

In my fork, available here:  
👉 [https://github.com/Undi95/abliteration/](https://github.com/Undi95/abliteration/)  
(based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))  

I introduced a new approach:
- Each layer computes its **own refusal direction**.
- The refusal direction is **layer-specific**, addressing the assumption that each layer has different characteristics and requirements.

## Hypothesis

This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:
- A more **usable** and **intelligent** model.
- A neutral starting point for further fine-tuning to reduce censorship without compromising performance.

## Next Steps

After applying this method, the model can be fine-tuned to:
- Reduce over-censoring behavior.
- Maintain neutrality while improving overall utility.