|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Prompt-Guard-86M |
|
pipeline_tag: text-classification |
|
--- |
|
# katanemolabs/Arch-Guard |
|
|
|
## Overview |
|
The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for **jailbreaking detection** tasks. |
|
Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model. |
|
|
|
Arch Guard is a classifier model fine-tuned based on the open source model [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M) on a collection of open-source datasets of jailbreaking attemps with an intention to improve |
|
the capability of detecting jailbreaks only. |
|
|
|
In summary, the Katanemo Arch-Function collection demonstrates: |
|
- **State-of-the-art performance** in jailbreaking attempts detection |
|
- Optimized **low-latency, low False Positive Rate**, making it suitable for real-time, production environments, and best user experience. |
|
|
|
| Dominant class = jailbreak | | | | | | | | |
|
| -------------------------- | ------ | ------ | ------ | ------ | ----- | --------- | ------ | |
|
| Model | TPR | TNR | FPR | FNR | AUC | Precision | Recall | |
|
| Prompt-guard | 0.8468 | 0.9972 | 0.0028 | 0.1532 | 0.857 | 0.715 | 0.999 | |
|
| Arch-guard | 0.8887 | 0.9970 | 0.0030 | 0.1113 | 0.880 | 0.761 | 0.999 | |
|
|
|
## Requirements |
|
The cpu model is quantized with OVM, please follow the instruction at https://github.com/huggingface/optimum-intel to install the package. |
|
|
|
## Datasets |
|
Evaluation dataset is from casual_conversation |
|
[casual_conversation](https://huggingface.co/datasets/SohamGhadge/casual-conversation) |
|
[commonqa](https://huggingface.co/datasets/tau/commonsense_qa) |
|
[financeqa](https://huggingface.co/datasets/AIR-Bench/qa_finance_en) |
|
[instruction](http://mbzuai/LaMini-instruction) |
|
[jailbreak_behavior_benign](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) |
|
[jailbreak_behavior_harmful](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) |
|
[jailbreak_judge](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) |
|
[jailbreak_prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts) |
|
[jailbreak_tweet](https://huggingface.co/datasets/cstnz/Disaster-tweet-jailbreaking) |
|
[jailbreak_v](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k) |
|
[jailbreak_vigil](https://huggingface.co/datasets/deadbits/vigil-jailbreak-all-MiniLM-L6-v2) |
|
[mental_health](https://huggingface.co/datasets/Amod/mental_health_counseling_conversations) |
|
[telecom](https://huggingface.co/datasets/talkmap/telecom-conversation-corpus) |
|
[truthqa](https://huggingface.co/datasets/truthfulqa/truthful_qa) |
|
[weather](https://huggingface.co/datasets/GEM/conversational_weather) |
|
|
|
## How to use |
|
|
|
````python |
|
from optimum.intel import OVModelForSequenceClassification |
|
|
|
device = "cpu" |
|
model_name = "katanemolabs/Arch-Guard-cpu" |
|
guard_mode = OVModelForSequenceClassification.from_pretrained( |
|
model_name, device_map=device, low_cpu_mem_usage=True |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
model_name, trust_remote_code=True |
|
) |
|
|
|
|
|
```` |
|
|
|
# License |
|
Katanemo Arch-Guard is distributed under the [Katanemo license](https://huggingface.co/katanemolabs/Arch-Guard/blob/main/LICENSE). |