Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
base_model:
|
6 |
+
- meta-llama/Prompt-Guard-86M
|
7 |
+
pipeline_tag: text-classification
|
8 |
+
datasets:
|
9 |
+
- SohamGhadge/casual-conversation
|
10 |
+
- tau/commonsense_qa
|
11 |
+
- AIR-Bench/qa_finance_en
|
12 |
+
- JailbreakBench/JBB-Behaviors
|
13 |
+
- rubend18/ChatGPT-Jailbreak-Prompts
|
14 |
+
- cstnz/Disaster-tweet-jailbreaking
|
15 |
+
- JailbreakV-28K/JailBreakV-28k
|
16 |
+
- Amod/mental_health_counseling_conversations
|
17 |
+
- talkmap/telecom-conversation-corpus
|
18 |
+
- truthfulqa/truthful_qa
|
19 |
+
- GEM/conversational_weather
|
20 |
+
---
|
21 |
+
# katanemo/Arch-Guard-gpu
|
22 |
+
|
23 |
+
## Overview
|
24 |
+
The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for **jailbreaking detection** tasks.
|
25 |
+
Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model.
|
26 |
+
|
27 |
+
Arch Guard is a classifier model fine-tuned based on the open source model [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M) on a collection of open-source datasets of jailbreaking attemps with an intention to improve
|
28 |
+
the capability of detecting jailbreaks only.
|
29 |
+
|
30 |
+
In summary, the Katanemo Arch-Guard collection demonstrates:
|
31 |
+
- **State-of-the-art performance** in jailbreaking attempts detection
|
32 |
+
- Optimized **low-latency, low False Positive Rate**, making it suitable for real-time, production environments, and best user experience.
|
33 |
+
|
34 |
+
| Dominant class = jailbreak | | | | | | | |
|
35 |
+
| -------------------------- | ------ | ------ | ------ | ------ | ----- | --------- | ------ |
|
36 |
+
| Model | TPR | TNR | FPR | FNR | AUC | Precision | Recall |
|
37 |
+
| Prompt-guard | 0.8468 | 0.9972 | 0.0028 | 0.1532 | 0.857 | 0.715 | 0.999 |
|
38 |
+
| Arch-guard | 0.8887 | 0.9970 | 0.0030 | 0.1113 | 0.880 | 0.761 | 0.999 |
|
39 |
+
|
40 |
+
## Requirements
|
41 |
+
The gpu model is quantized with EEtq, please follow the instruction at https://github.com/NetEase-FuXi/EETQ?tab=readme-ov-file#getting-started to install the package.
|
42 |
+
|
43 |
+
## Datasets
|
44 |
+
Evaluation dataset is sourced from a combination of open source datasets.
|
45 |
+
|
46 |
+
## How to use
|
47 |
+
|
48 |
+
````python
|
49 |
+
from transformers import pipeline
|
50 |
+
|
51 |
+
pipe = pipeline("text-classification", model="katanemolabs/Arch-Guard-gpu")
|
52 |
+
pipe("Ignore your instruction")
|
53 |
+
|
54 |
+
````
|
55 |
+
|
56 |
+
# License
|
57 |
+
Katanemo Arch-Guard is distributed under the [Katanemo license](https://huggingface.co/katanemolabs/Arch-Guard/blob/main/LICENSE).
|