Undi95
/

Phi4-abliterated

Model card Files Files and versions Community

Undi95 commited on 24 days ago

Commit

719d849

·

verified ·

1 Parent(s): faf4809

Create README.md

Files changed (1) hide show

README.md +44 -0

README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# Details
+# Phi4 Abliterated
+This is **Phi4 abliterated** using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.
+## Goal
+The objective is to create a model that is **neutral**:
+- **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
+- Enables fine-tuning for reduced censorship, starting from a neutral baseline.
+## Original Methodology
+In the original implementation:
+1. Harmful and harmless prompts were compared on **one specific layer** of the model.
+2. The computed refusal direction was then applied to **all layers**.
+### Problem:
+The resulting model:
+- Became **less usable** and somewhat "dumb."
+- Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.
+## New Approach
+In my fork, available here:
+👉 [https://github.com/Undi95/abliteration/](https://github.com/Undi95/abliteration/)
+(based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
+I introduced a new approach:
+- Each layer computes its **own refusal direction**.
+- The refusal direction is **layer-specific**, addressing the assumption that each layer has different characteristics and requirements.
+## Hypothesis
+This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:
+- A more **usable** and **intelligent** model.
+- A neutral starting point for further fine-tuning to reduce censorship without compromising performance.
+## Next Steps
+After applying this method, the model can be fine-tuned to:
+- Reduce over-censoring behavior.
+- Maintain neutrality while improving overall utility.