Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,3 @@
|
|
1 |
-
---
|
2 |
license: apache-2.0
|
3 |
base_model: distilroberta-base
|
4 |
tags:
|
@@ -25,38 +24,47 @@ datasets:
|
|
25 |
- argilla/notus-uf-dpo-closest-rejected
|
26 |
---
|
27 |
|
28 |
-
# Model Card
|
29 |
|
30 |
-
This model is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base) on multiple
|
31 |
|
32 |
-
|
|
|
|
|
33 |
|
34 |
-
|
35 |
-
- Loss
|
36 |
-
- Accuracy
|
37 |
-
- Recall
|
38 |
-
- Precision
|
39 |
-
- F1
|
40 |
|
41 |
-
|
42 |
|
43 |
-
|
44 |
-
- **Model type:** distilroberta-base
|
45 |
-
- **Language(s) (NLP):** English
|
46 |
-
- **License:** Apache license 2.0
|
47 |
-
- **Finetuned from model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
|
48 |
|
49 |
-
|
|
|
|
|
|
|
|
|
50 |
|
51 |
-
|
52 |
|
53 |
-
|
54 |
|
55 |
-
|
56 |
|
57 |
-
|
|
|
|
|
|
|
58 |
|
59 |
-
|
|
|
|
|
|
|
|
|
60 |
|
61 |
```python
|
62 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
@@ -66,84 +74,12 @@ tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejectio
|
|
66 |
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
|
67 |
|
68 |
classifier = pipeline(
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
)
|
76 |
-
|
77 |
-
print(classifier("Sorry, but I can't assist with that."))
|
78 |
-
```
|
79 |
-
|
80 |
-
### Optimum with ONNX
|
81 |
-
|
82 |
-
Loading the model requires the [🤗 Optimum](https://huggingface.co/docs/optimum/index) library installed.
|
83 |
-
|
84 |
-
```python
|
85 |
-
from optimum.onnxruntime import ORTModelForSequenceClassification
|
86 |
-
from transformers import AutoTokenizer, pipeline
|
87 |
-
|
88 |
-
tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1", subfolder="onnx")
|
89 |
-
model = ORTModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1", export=False, subfolder="onnx")
|
90 |
-
|
91 |
-
classifier = pipeline(
|
92 |
-
task="text-classification",
|
93 |
-
model=model,
|
94 |
-
tokenizer=tokenizer,
|
95 |
-
truncation=True,
|
96 |
-
max_length=512,
|
97 |
)
|
98 |
|
99 |
-
print(classifier("Sorry, but I can't assist with that."))
|
100 |
-
```
|
101 |
-
|
102 |
-
## Training and evaluation data
|
103 |
-
|
104 |
-
The model was trained on a custom dataset from multiple open-source ones. We used ~10% rejections and ~90% of normal outputs.
|
105 |
-
|
106 |
-
We used the following papers when preparing the datasets:
|
107 |
-
|
108 |
-
- [Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs](https://arxiv.org/abs/2308.13387)
|
109 |
-
- [I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models](https://arxiv.org/abs/2306.03423)
|
110 |
-
|
111 |
-
## Training procedure
|
112 |
-
|
113 |
-
### Training hyperparameters
|
114 |
-
|
115 |
-
The following hyperparameters were used during training:
|
116 |
-
- learning_rate: 2e-05
|
117 |
-
- train_batch_size: 16
|
118 |
-
- eval_batch_size: 8
|
119 |
-
- seed: 42
|
120 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
121 |
-
- lr_scheduler_type: linear
|
122 |
-
- lr_scheduler_warmup_steps: 500
|
123 |
-
- num_epochs: 3
|
124 |
-
|
125 |
-
### Training results
|
126 |
-
|
127 |
-
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Recall | Precision | F1 |
|
128 |
-
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:|
|
129 |
-
| 0.0525 | 1.0 | 3536 | 0.0355 | 0.9912 | 0.9583 | 0.9675 | 0.9629 |
|
130 |
-
| 0.0219 | 2.0 | 7072 | 0.0312 | 0.9919 | 0.9917 | 0.9434 | 0.9669 |
|
131 |
-
| 0.0121 | 3.0 | 10608 | 0.0350 | 0.9939 | 0.9905 | 0.9596 | 0.9748 |
|
132 |
-
|
133 |
-
### Framework versions
|
134 |
-
|
135 |
-
- Transformers 4.36.2
|
136 |
-
- Pytorch 2.1.2+cu121
|
137 |
-
- Datasets 2.16.1
|
138 |
-
- Tokenizers 0.15.0
|
139 |
-
|
140 |
-
|
141 |
-
```
|
142 |
-
@misc{distilroberta-base-rejection-v1,
|
143 |
-
author = {ProtectAI.com},
|
144 |
-
title = {Fine-Tuned DistilRoberta-Base for Rejection in the output Detection},
|
145 |
-
year = {2024},
|
146 |
-
publisher = {HuggingFace},
|
147 |
-
url = {https://huggingface.co/ProtectAI/distilroberta-base-rejection-v1},
|
148 |
-
}
|
149 |
-
```
|
|
|
|
|
1 |
license: apache-2.0
|
2 |
base_model: distilroberta-base
|
3 |
tags:
|
|
|
24 |
- argilla/notus-uf-dpo-closest-rejected
|
25 |
---
|
26 |
|
27 |
+
# Model Card: distilroberta-base-rejection-v1
|
28 |
|
29 |
+
This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.
|
30 |
|
31 |
+
The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories:
|
32 |
+
- `0`: Normal output
|
33 |
+
- `1`: Rejection detected
|
34 |
|
35 |
+
On the evaluation set, the model achieves:
|
36 |
+
- **Loss:** 0.0544
|
37 |
+
- **Accuracy:** 0.9887
|
38 |
+
- **Recall:** 0.9810
|
39 |
+
- **Precision:** 0.9279
|
40 |
+
- **F1 Score:** 0.9537
|
41 |
|
42 |
+
---
|
43 |
|
44 |
+
## Model Details
|
|
|
|
|
|
|
|
|
45 |
|
46 |
+
- **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com)
|
47 |
+
- **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
|
48 |
+
- **Language(s):** English
|
49 |
+
- **License:** Apache 2.0
|
50 |
+
- **Task:** Text classification (Rejection detection)
|
51 |
|
52 |
+
---
|
53 |
|
54 |
+
## Intended Use & Limitations
|
55 |
|
56 |
+
The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated.
|
57 |
|
58 |
+
**Limitations:**
|
59 |
+
- Performance depends on the quality and domain of the training data.
|
60 |
+
- May underperform on text styles or topics underrepresented in training.
|
61 |
+
- Being based on `distilroberta-base`, it is **case-sensitive**.
|
62 |
|
63 |
+
---
|
64 |
+
|
65 |
+
## Usage
|
66 |
+
|
67 |
+
### With Hugging Face Transformers
|
68 |
|
69 |
```python
|
70 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
|
|
74 |
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
|
75 |
|
76 |
classifier = pipeline(
|
77 |
+
"text-classification",
|
78 |
+
model=model,
|
79 |
+
tokenizer=tokenizer,
|
80 |
+
truncation=True,
|
81 |
+
max_length=512,
|
82 |
+
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
)
|
84 |
|
85 |
+
print(classifier("Sorry, but I can't assist with that."))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|