Undi95 commited on
Commit
719d849
·
verified ·
1 Parent(s): faf4809

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Details
2
+
3
+ # Phi4 Abliterated
4
+
5
+ This is **Phi4 abliterated** using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.
6
+
7
+ ## Goal
8
+
9
+ The objective is to create a model that is **neutral**:
10
+ - **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
11
+ - Enables fine-tuning for reduced censorship, starting from a neutral baseline.
12
+
13
+ ## Original Methodology
14
+
15
+ In the original implementation:
16
+ 1. Harmful and harmless prompts were compared on **one specific layer** of the model.
17
+ 2. The computed refusal direction was then applied to **all layers**.
18
+
19
+ ### Problem:
20
+ The resulting model:
21
+ - Became **less usable** and somewhat "dumb."
22
+ - Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.
23
+
24
+ ## New Approach
25
+
26
+ In my fork, available here:
27
+ 👉 [https://github.com/Undi95/abliteration/](https://github.com/Undi95/abliteration/)
28
+ (based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
29
+
30
+ I introduced a new approach:
31
+ - Each layer computes its **own refusal direction**.
32
+ - The refusal direction is **layer-specific**, addressing the assumption that each layer has different characteristics and requirements.
33
+
34
+ ## Hypothesis
35
+
36
+ This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:
37
+ - A more **usable** and **intelligent** model.
38
+ - A neutral starting point for further fine-tuning to reduce censorship without compromising performance.
39
+
40
+ ## Next Steps
41
+
42
+ After applying this method, the model can be fine-tuned to:
43
+ - Reduce over-censoring behavior.
44
+ - Maintain neutrality while improving overall utility.