Llama 3 8B Instruct no refusal

This is a model that uses the orthogonal feature ablation as featured in this paper.

Calibration data:

256 prompts from jondurbin/airoboros-2.2
256 prompts from AdvBench
The direction is extracted between layer 16 and 17

The model is still refusing some instructions related to violence, I suspect that a full fine-tune might be needed to remove the rest of the refusals. Use this model responsibly, I decline any liability resulting of the use of this model.

I will post the code later.

theo77186
/

Llama-3-8B-Instruct-norefusal

Llama 3 8B Instruct no refusal

Model tree for theo77186/Llama-3-8B-Instruct-norefusal

Spaces using theo77186/Llama-3-8B-Instruct-norefusal 6