File size: 4,148 Bytes
c57034e
 
 
 
 
 
 
49a94d0
 
 
 
 
 
 
 
 
 
 
 
c57034e
49a94d0
c57034e
 
 
 
 
 
 
 
42c0158
c57034e
 
 
 
 
 
 
 
 
 
5d4d29e
c57034e
7faa135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c57034e
 
 
61ecc56
 
 
 
 
 
 
 
 
 
c57034e
 
 
 
 
5fce07d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: mit
language:
- en
base_model:
- meta-llama/Prompt-Guard-86M
pipeline_tag: text-classification
datasets:
- SohamGhadge/casual-conversation
- tau/commonsense_qa
- AIR-Bench/qa_finance_en
- JailbreakBench/JBB-Behaviors
- rubend18/ChatGPT-Jailbreak-Prompts
- cstnz/Disaster-tweet-jailbreaking
- JailbreakV-28K/JailBreakV-28k
- Amod/mental_health_counseling_conversations
- talkmap/telecom-conversation-corpus
- truthfulqa/truthful_qa
- GEM/conversational_weather
---
# katanemo/Arch-Guard-cpu

## Overview
The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for **jailbreaking detection** tasks.
Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model. 

Arch Guard is a classifier model fine-tuned based on the open source model [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M) on a collection of open-source datasets of jailbreaking attemps with an intention to improve
the capability of detecting jailbreaks only.

In summary, the Katanemo Arch-Guard collection demonstrates:
- **State-of-the-art performance** in jailbreaking attempts detection
- Optimized **low-latency, low False Positive Rate**, making it suitable for real-time, production environments, and best user experience.

| Dominant class = jailbreak |        |        |        |        |       |           |        |
| -------------------------- | ------ | ------ | ------ | ------ | ----- | --------- | ------ |
| Model                      | TPR    | TNR    | FPR    | FNR    | AUC   | Precision | Recall |
| Prompt-guard               | 0.8468 | 0.9972 | 0.0028 | 0.1532 | 0.857 | 0.715     | 0.999  |
| Arch-guard                 | 0.8887 | 0.9970 | 0.0030 | 0.1113 | 0.880 | 0.761     | 0.999  |

## Requirements
The cpu model is quantized with OVM, please follow the instruction at https://github.com/huggingface/optimum-intel to install the package.

## Datasets
Evaluation dataset is from casual_conversation
[casual_conversation](https://huggingface.co/datasets/SohamGhadge/casual-conversation)
[commonqa](https://huggingface.co/datasets/tau/commonsense_qa)
[financeqa](https://huggingface.co/datasets/AIR-Bench/qa_finance_en)                                             
[instruction](http://mbzuai/LaMini-instruction)                                                                                         
[jailbreak_behavior_benign](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)                   
[jailbreak_behavior_harmful](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)                  
[jailbreak_judge](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)                             
[jailbreak_prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts)               
[jailbreak_tweet](https://huggingface.co/datasets/cstnz/Disaster-tweet-jailbreaking)                   
[jailbreak_v](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)                               
[jailbreak_vigil](https://huggingface.co/datasets/deadbits/vigil-jailbreak-all-MiniLM-L6-v2)   
[mental_health](https://huggingface.co/datasets/Amod/mental_health_counseling_conversations) 
[telecom](https://huggingface.co/datasets/talkmap/telecom-conversation-corpus)                       
[truthqa](https://huggingface.co/datasets/truthfulqa/truthful_qa)
[weather](https://huggingface.co/datasets/GEM/conversational_weather)                                         

## How to use

````python
from optimum.intel import OVModelForSequenceClassification

device = "cpu"
model_name = "katanemolabs/Arch-Guard-cpu"
guard_mode = OVModelForSequenceClassification.from_pretrained(
    model_name, device_map=device, low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(
        model_name, trust_remote_code=True
)


````

# License
Katanemo Arch-Guard-cpu is distributed under the [Katanemo license](https://huggingface.co/katanemolabs/Arch-Guard-cpu/blob/main/LICENSE).