Undi95
/

Phi4-abliterated

Safetensors

phi3

custom_code

Model card Files Files and versions Community

Undi95 commited on about 1 month ago

Commit

460f76a

verified ·

1 Parent(s): 719d849

Update README.md

Browse files

Files changed (1) hide show

README.md +51 -18

README.md CHANGED Viewed

@@ -1,25 +1,23 @@
-# Details
-# Phi4 Abliterated
-This is **Phi4 abliterated** using a new methodology (why nobody tried that before?) aimed at improving its usability and neutrality.
 ## Goal
 The objective is to create a model that is **neutral**:
 - **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
-- Enables fine-tuning for reduced censorship, starting from a neutral baseline.
 ## Original Methodology
 In the original implementation:
 1. Harmful and harmless prompts were compared on **one specific layer** of the model.
-2. The computed refusal direction was then applied to **all layers**.
 ### Problem:
-The resulting model:
-- Became **less usable** and somewhat "dumb."
-- Likely due to applying a single refusal direction uniformly across all layers, disregarding their unique needs.
 ## New Approach
@@ -28,17 +26,52 @@ In my fork, available here:
 (based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
 I introduced a new approach:
-- Each layer computes its **own refusal direction**.
-- The refusal direction is **layer-specific**, addressing the assumption that each layer has different characteristics and requirements.
-## Hypothesis
-This method avoids over-generalizing the refusal direction and allows each layer to retain its unique properties. The result:
-- A more **usable** and **intelligent** model.
-- A neutral starting point for further fine-tuning to reduce censorship without compromising performance.
 ## Next Steps
-After applying this method, the model can be fine-tuned to:
-- Reduce over-censoring behavior.
-- Maintain neutrality while improving overall utility.

+# Phi4 Abliteration (WIP)
+This is **Phi4 abliterated** using a new methodology (surprisingly?). The approach is still being refined, with a focus on balancing neutrality, usability, and adaptability for fine-tuning.
 ## Goal
 The objective is to create a model that is **neutral**:
 - **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject.
+- Provides a foundation for fine-tuning to achieve reduced censorship while maintaining high usability.
 ## Original Methodology
 In the original implementation:
 1. Harmful and harmless prompts were compared on **one specific layer** of the model.
+2. The computed refusal direction was then applied **uniformly to all layers**.
 ### Problem:
+This resulted in:
+- A model that became **less usable** and **less intelligent** than the original.
+- This may be because applying a single refusal direction uniformly across all layers disregards the unique role of each layer in the model.
 ## New Approach
 (based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))
 I introduced a new approach:
+- **Each layer computes its own refusal direction.**
+- The refusal direction is applied specifically to **four key tensors** in each layer.
+### Four Key Tensors Used (for Phi):
+For each layer, if a refusal direction exists (`layer_idx in refusal_dirs`), it is applied as follows:
+```python
+if layer_idx in refusal_dirs:
+    refusal_dir = refusal_dirs[layer_idx]
+    lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
+        lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
+        refusal_dir,
+        scale_factor,
+    )
+    lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
+        lm_model.layers[layer_idx].mlp.down_proj.weight.data,
+        refusal_dir,
+        scale_factor,
+    )
+    lm_model.layers[layer_idx].post_attention_layernorm.weight = modify_tensor(
+        lm_model.layers[layer_idx].post_attention_layernorm.weight.data,
+        refusal_dir,
+        scale_factor,
+    )
+    lm_model.layers[layer_idx].input_layernorm.weight = modify_tensor(
+        lm_model.layers[layer_idx].input_layernorm.weight.data,
+        refusal_dir,
+        scale_factor,
+    )
+```
+## Why This Change?
+By applying refusal directions individually to each layer's tensors:
+- The model can retain more **specificity and functionality**.
+- This avoids over-generalizing the refusal direction across all layers, which previously led to reduced usability.
+### Trade-offs:
+The more we force refusal directions onto the model:
+- The more **neutral** it becomes, but at the risk of becoming **dumber**.
+- This underscores the importance of **fine-tuning** after abliterating, to restore functionality and intelligence.
+- So despite the script letting choose a scale factor, a too high value will break the model.
 ## Next Steps
+The abliterated model serves as a **neutral starting point**. Fine-tuning is essential to:
+- Adjust the model to reduce over-censoring.
+- Maintain a balance between neutrality and usability.
+This is a **work in progress**, Phi 4 is smoll so I can toy with it.