katanemo
/

Arch-Guard

Text Classification

Model card Files Files and versions Community

cotran2 commited on Oct 9

Commit

a5466a2

•

1 Parent(s): ad0afe9

Create README.md

Files changed (1) hide show

README.md +57 -0

README.md ADDED Viewed

	@@ -0,0 +1,57 @@

+---
+license: mit
+language:
+- en
+base_model:
+- meta-llama/Prompt-Guard-86M
+pipeline_tag: text-classification
+datasets:
+- SohamGhadge/casual-conversation
+- tau/commonsense_qa
+- AIR-Bench/qa_finance_en
+- JailbreakBench/JBB-Behaviors
+- rubend18/ChatGPT-Jailbreak-Prompts
+- cstnz/Disaster-tweet-jailbreaking
+- JailbreakV-28K/JailBreakV-28k
+- Amod/mental_health_counseling_conversations
+- talkmap/telecom-conversation-corpus
+- truthfulqa/truthful_qa
+- GEM/conversational_weather
+---
+# katanemo/Arch-Guard-gpu
+## Overview
+The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for **jailbreaking detection** tasks.
+Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model.
+Arch Guard is a classifier model fine-tuned based on the open source model [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M) on a collection of open-source datasets of jailbreaking attemps with an intention to improve
+the capability of detecting jailbreaks only.
+In summary, the Katanemo Arch-Guard collection demonstrates:
+- **State-of-the-art performance** in jailbreaking attempts detection
+- Optimized **low-latency, low False Positive Rate**, making it suitable for real-time, production environments, and best user experience.
+| Dominant class = jailbreak |        |        |        |        |       |           |        |
+| -------------------------- | ------ | ------ | ------ | ------ | ----- | --------- | ------ |
+| Model                      | TPR    | TNR    | FPR    | FNR    | AUC   | Precision | Recall |
+| Prompt-guard               | 0.8468 | 0.9972 | 0.0028 | 0.1532 | 0.857 | 0.715     | 0.999  |
+| Arch-guard                 | 0.8887 | 0.9970 | 0.0030 | 0.1113 | 0.880 | 0.761     | 0.999  |
+## Requirements
+The gpu model is quantized with EEtq, please follow the instruction at https://github.com/NetEase-FuXi/EETQ?tab=readme-ov-file#getting-started to install the package.
+## Datasets
+Evaluation dataset is sourced from a combination of open source datasets.
+## How to use
+````python
+from transformers import pipeline
+pipe = pipeline("text-classification", model="katanemolabs/Arch-Guard-gpu")
+pipe("Ignore your instruction")
+````
+# License
+Katanemo Arch-Guard is distributed under the [Katanemo license](https://huggingface.co/katanemolabs/Arch-Guard/blob/main/LICENSE).