Undi95 commited on
Commit
460f76a
·
verified ·
1 Parent(s): 719d849

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -18
README.md CHANGED
@@ -1,25 +1,23 @@
1
- # Details
2
 
3
- # Phi4 Abliterated
4
-
5
- This is **Phi4 abliterated** using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.
6
 
7
  ## Goal
8
 
9
  The objective is to create a model that is **neutral**:
10
  - **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
11
- - Enables fine-tuning for reduced censorship, starting from a neutral baseline.
12
 
13
  ## Original Methodology
14
 
15
  In the original implementation:
16
  1. Harmful and harmless prompts were compared on **one specific layer** of the model.
17
- 2. The computed refusal direction was then applied to **all layers**.
18
 
19
  ### Problem:
20
- The resulting model:
21
- - Became **less usable** and somewhat "dumb."
22
- - Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.
23
 
24
  ## New Approach
25
 
@@ -28,17 +26,52 @@ In my fork, available here:
28
  (based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
29
 
30
  I introduced a new approach:
31
- - Each layer computes its **own refusal direction**.
32
- - The refusal direction is **layer-specific**, addressing the assumption that each layer has different characteristics and requirements.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- ## Hypothesis
35
 
36
- This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:
37
- - A more **usable** and **intelligent** model.
38
- - A neutral starting point for further fine-tuning to reduce censorship without compromising performance.
 
 
 
 
 
 
39
 
40
  ## Next Steps
41
 
42
- After applying this method, the model can be fine-tuned to:
43
- - Reduce over-censoring behavior.
44
- - Maintain neutrality while improving overall utility.
 
 
 
1
+ # Phi4 Abliteration (WIP)
2
 
3
+ This is **Phi4 abliterated** using a new methodology (surprisingly?). The approach is still being refined, with a focus on balancing neutrality, usability, and adaptability for fine-tuning.
 
 
4
 
5
  ## Goal
6
 
7
  The objective is to create a model that is **neutral**:
8
  - **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
9
+ - Provides a foundation for fine-tuning to achieve reduced censorship while maintaining high usability.
10
 
11
  ## Original Methodology
12
 
13
  In the original implementation:
14
  1. Harmful and harmless prompts were compared on **one specific layer** of the model.
15
+ 2. The computed refusal direction was then applied **uniformly to all layers**.
16
 
17
  ### Problem:
18
+ This resulted in:
19
+ - A model that became **less usable** and **less intelligent** than the original.
20
+ - This may be because applying a single refusal direction uniformly across all layers disregards the unique role of each layer in the model.
21
 
22
  ## New Approach
23
 
 
26
  (based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
27
 
28
  I introduced a new approach:
29
+ - **Each layer computes its own refusal direction.**
30
+ - The refusal direction is applied specifically to **four key tensors** in each layer.
31
+
32
+ ### Four Key Tensors Used (for Phi):
33
+ For each layer, if a refusal direction exists (`layer_idx in refusal_dirs`), it is applied as follows:
34
+ ```python
35
+ if layer_idx in refusal_dirs:
36
+ refusal_dir = refusal_dirs[layer_idx]
37
+ lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
38
+ lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
39
+ refusal_dir,
40
+ scale_factor,
41
+ )
42
+ lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
43
+ lm_model.layers[layer_idx].mlp.down_proj.weight.data,
44
+ refusal_dir,
45
+ scale_factor,
46
+ )
47
+ lm_model.layers[layer_idx].post_attention_layernorm.weight = modify_tensor(
48
+ lm_model.layers[layer_idx].post_attention_layernorm.weight.data,
49
+ refusal_dir,
50
+ scale_factor,
51
+ )
52
+ lm_model.layers[layer_idx].input_layernorm.weight = modify_tensor(
53
+ lm_model.layers[layer_idx].input_layernorm.weight.data,
54
+ refusal_dir,
55
+ scale_factor,
56
+ )
57
+ ```
58
 
59
+ ## Why This Change?
60
 
61
+ By applying refusal directions individually to each layer's tensors:
62
+ - The model can retain more **specificity and functionality**.
63
+ - This avoids over-generalizing the refusal direction across all layers, which previously led to reduced usability.
64
+
65
+ ### Trade-offs:
66
+ The more we force refusal directions onto the model:
67
+ - The more **neutral** it becomes, but at the risk of becoming **dumber**.
68
+ - This underscores the importance of **fine-tuning** after abliterating, to restore functionality and intelligence.
69
+ - So despite the script letting choose a scale factor, a too high value will break the model.
70
 
71
  ## Next Steps
72
 
73
+ The abliterated model serves as a **neutral starting point**. Fine-tuning is essential to:
74
+ - Adjust the model to reduce over-censoring.
75
+ - Maintain a balance between neutrality and usability.
76
+
77
+ This is a **work in progress**, Phi 4 is smoll so I can toy with it.