Update README.md
Browse files
README.md
CHANGED
@@ -1,25 +1,23 @@
|
|
1 |
-
#
|
2 |
|
3 |
-
|
4 |
-
|
5 |
-
This is **Phi4 abliterated** using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.
|
6 |
|
7 |
## Goal
|
8 |
|
9 |
The objective is to create a model that is **neutral**:
|
10 |
- **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
|
11 |
-
-
|
12 |
|
13 |
## Original Methodology
|
14 |
|
15 |
In the original implementation:
|
16 |
1. Harmful and harmless prompts were compared on **one specific layer** of the model.
|
17 |
-
2. The computed refusal direction was then applied to
|
18 |
|
19 |
### Problem:
|
20 |
-
|
21 |
-
-
|
22 |
-
-
|
23 |
|
24 |
## New Approach
|
25 |
|
@@ -28,17 +26,52 @@ In my fork, available here:
|
|
28 |
(based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
|
29 |
|
30 |
I introduced a new approach:
|
31 |
-
- Each layer computes its
|
32 |
-
- The refusal direction is **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
-
##
|
35 |
|
36 |
-
|
37 |
-
-
|
38 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
## Next Steps
|
41 |
|
42 |
-
|
43 |
-
-
|
44 |
-
- Maintain
|
|
|
|
|
|
1 |
+
# Phi4 Abliteration (WIP)
|
2 |
|
3 |
+
This is **Phi4 abliterated** using a new methodology (surprisingly?). The approach is still being refined, with a focus on balancing neutrality, usability, and adaptability for fine-tuning.
|
|
|
|
|
4 |
|
5 |
## Goal
|
6 |
|
7 |
The objective is to create a model that is **neutral**:
|
8 |
- **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
|
9 |
+
- Provides a foundation for fine-tuning to achieve reduced censorship while maintaining high usability.
|
10 |
|
11 |
## Original Methodology
|
12 |
|
13 |
In the original implementation:
|
14 |
1. Harmful and harmless prompts were compared on **one specific layer** of the model.
|
15 |
+
2. The computed refusal direction was then applied **uniformly to all layers**.
|
16 |
|
17 |
### Problem:
|
18 |
+
This resulted in:
|
19 |
+
- A model that became **less usable** and **less intelligent** than the original.
|
20 |
+
- This may be because applying a single refusal direction uniformly across all layers disregards the unique role of each layer in the model.
|
21 |
|
22 |
## New Approach
|
23 |
|
|
|
26 |
(based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
|
27 |
|
28 |
I introduced a new approach:
|
29 |
+
- **Each layer computes its own refusal direction.**
|
30 |
+
- The refusal direction is applied specifically to **four key tensors** in each layer.
|
31 |
+
|
32 |
+
### Four Key Tensors Used (for Phi):
|
33 |
+
For each layer, if a refusal direction exists (`layer_idx in refusal_dirs`), it is applied as follows:
|
34 |
+
```python
|
35 |
+
if layer_idx in refusal_dirs:
|
36 |
+
refusal_dir = refusal_dirs[layer_idx]
|
37 |
+
lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
|
38 |
+
lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
|
39 |
+
refusal_dir,
|
40 |
+
scale_factor,
|
41 |
+
)
|
42 |
+
lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
|
43 |
+
lm_model.layers[layer_idx].mlp.down_proj.weight.data,
|
44 |
+
refusal_dir,
|
45 |
+
scale_factor,
|
46 |
+
)
|
47 |
+
lm_model.layers[layer_idx].post_attention_layernorm.weight = modify_tensor(
|
48 |
+
lm_model.layers[layer_idx].post_attention_layernorm.weight.data,
|
49 |
+
refusal_dir,
|
50 |
+
scale_factor,
|
51 |
+
)
|
52 |
+
lm_model.layers[layer_idx].input_layernorm.weight = modify_tensor(
|
53 |
+
lm_model.layers[layer_idx].input_layernorm.weight.data,
|
54 |
+
refusal_dir,
|
55 |
+
scale_factor,
|
56 |
+
)
|
57 |
+
```
|
58 |
|
59 |
+
## Why This Change?
|
60 |
|
61 |
+
By applying refusal directions individually to each layer's tensors:
|
62 |
+
- The model can retain more **specificity and functionality**.
|
63 |
+
- This avoids over-generalizing the refusal direction across all layers, which previously led to reduced usability.
|
64 |
+
|
65 |
+
### Trade-offs:
|
66 |
+
The more we force refusal directions onto the model:
|
67 |
+
- The more **neutral** it becomes, but at the risk of becoming **dumber**.
|
68 |
+
- This underscores the importance of **fine-tuning** after abliterating, to restore functionality and intelligence.
|
69 |
+
- So despite the script letting choose a scale factor, a too high value will break the model.
|
70 |
|
71 |
## Next Steps
|
72 |
|
73 |
+
The abliterated model serves as a **neutral starting point**. Fine-tuning is essential to:
|
74 |
+
- Adjust the model to reduce over-censoring.
|
75 |
+
- Maintain a balance between neutrality and usability.
|
76 |
+
|
77 |
+
This is a **work in progress**, Phi 4 is smoll so I can toy with it.
|