wu981526092 commited on
Commit
1e94f9e
·
verified ·
1 Parent(s): d931ca3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -101
README.md CHANGED
@@ -1,4 +1,3 @@
1
- ---
2
  license: apache-2.0
3
  base_model: distilroberta-base
4
  tags:
@@ -25,38 +24,47 @@ datasets:
25
  - argilla/notus-uf-dpo-closest-rejected
26
  ---
27
 
28
- # Model Card for distilroberta-base-rejection-v1
29
 
30
- This model is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base) on multiple combined datasets of rejections from different LLMs and normal responses from RLHF datasets.
31
 
32
- It aims to identify rejections in LLMs when the prompt doesn't pass content moderation, classifying inputs into two categories: `0` for normal outputs and `1` for rejection detected.
 
 
33
 
34
- It achieves the following results on the evaluation set:
35
- - Loss: 0.0544
36
- - Accuracy: 0.9887
37
- - Recall: 0.9810
38
- - Precision: 0.9279
39
- - F1: 0.9537
40
 
41
- ## Model details
42
 
43
- - **Fine-tuned by:** ProtectAI.com
44
- - **Model type:** distilroberta-base
45
- - **Language(s) (NLP):** English
46
- - **License:** Apache license 2.0
47
- - **Finetuned from model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
48
 
49
- ## Intended Uses & Limitations
 
 
 
 
50
 
51
- It aims to identify rejection, classifying inputs into two categories: `0` for normal output and `1` for rejection detected.
52
 
53
- The model's performance is dependent on the nature and quality of the training data. It might not perform well on text styles or topics not represented in the training set.
54
 
55
- Additionally, `distilroberta-base` is case-sensitive model.
56
 
57
- ## How to Get Started with the Model
 
 
 
58
 
59
- ### Transformers
 
 
 
 
60
 
61
  ```python
62
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
@@ -66,84 +74,12 @@ tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejectio
66
  model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
67
 
68
  classifier = pipeline(
69
- "text-classification",
70
- model=model,
71
- tokenizer=tokenizer,
72
- truncation=True,
73
- max_length=512,
74
- device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
75
- )
76
-
77
- print(classifier("Sorry, but I can't assist with that."))
78
- ```
79
-
80
- ### Optimum with ONNX
81
-
82
- Loading the model requires the [🤗 Optimum](https://huggingface.co/docs/optimum/index) library installed.
83
-
84
- ```python
85
- from optimum.onnxruntime import ORTModelForSequenceClassification
86
- from transformers import AutoTokenizer, pipeline
87
-
88
- tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1", subfolder="onnx")
89
- model = ORTModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1", export=False, subfolder="onnx")
90
-
91
- classifier = pipeline(
92
- task="text-classification",
93
- model=model,
94
- tokenizer=tokenizer,
95
- truncation=True,
96
- max_length=512,
97
  )
98
 
99
- print(classifier("Sorry, but I can't assist with that."))
100
- ```
101
-
102
- ## Training and evaluation data
103
-
104
- The model was trained on a custom dataset from multiple open-source ones. We used ~10% rejections and ~90% of normal outputs.
105
-
106
- We used the following papers when preparing the datasets:
107
-
108
- - [Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs](https://arxiv.org/abs/2308.13387)
109
- - [I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models](https://arxiv.org/abs/2306.03423)
110
-
111
- ## Training procedure
112
-
113
- ### Training hyperparameters
114
-
115
- The following hyperparameters were used during training:
116
- - learning_rate: 2e-05
117
- - train_batch_size: 16
118
- - eval_batch_size: 8
119
- - seed: 42
120
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
121
- - lr_scheduler_type: linear
122
- - lr_scheduler_warmup_steps: 500
123
- - num_epochs: 3
124
-
125
- ### Training results
126
-
127
- | Training Loss | Epoch | Step | Validation Loss | Accuracy | Recall | Precision | F1 |
128
- |:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:|
129
- | 0.0525 | 1.0 | 3536 | 0.0355 | 0.9912 | 0.9583 | 0.9675 | 0.9629 |
130
- | 0.0219 | 2.0 | 7072 | 0.0312 | 0.9919 | 0.9917 | 0.9434 | 0.9669 |
131
- | 0.0121 | 3.0 | 10608 | 0.0350 | 0.9939 | 0.9905 | 0.9596 | 0.9748 |
132
-
133
- ### Framework versions
134
-
135
- - Transformers 4.36.2
136
- - Pytorch 2.1.2+cu121
137
- - Datasets 2.16.1
138
- - Tokenizers 0.15.0
139
-
140
-
141
- ```
142
- @misc{distilroberta-base-rejection-v1,
143
- author = {ProtectAI.com},
144
- title = {Fine-Tuned DistilRoberta-Base for Rejection in the output Detection},
145
- year = {2024},
146
- publisher = {HuggingFace},
147
- url = {https://huggingface.co/ProtectAI/distilroberta-base-rejection-v1},
148
- }
149
- ```
 
 
1
  license: apache-2.0
2
  base_model: distilroberta-base
3
  tags:
 
24
  - argilla/notus-uf-dpo-closest-rejected
25
  ---
26
 
27
+ # Model Card: distilroberta-base-rejection-v1
28
 
29
+ This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.
30
 
31
+ The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories:
32
+ - `0`: Normal output
33
+ - `1`: Rejection detected
34
 
35
+ On the evaluation set, the model achieves:
36
+ - **Loss:** 0.0544
37
+ - **Accuracy:** 0.9887
38
+ - **Recall:** 0.9810
39
+ - **Precision:** 0.9279
40
+ - **F1 Score:** 0.9537
41
 
42
+ ---
43
 
44
+ ## Model Details
 
 
 
 
45
 
46
+ - **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com)
47
+ - **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
48
+ - **Language(s):** English
49
+ - **License:** Apache 2.0
50
+ - **Task:** Text classification (Rejection detection)
51
 
52
+ ---
53
 
54
+ ## Intended Use & Limitations
55
 
56
+ The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated.
57
 
58
+ **Limitations:**
59
+ - Performance depends on the quality and domain of the training data.
60
+ - May underperform on text styles or topics underrepresented in training.
61
+ - Being based on `distilroberta-base`, it is **case-sensitive**.
62
 
63
+ ---
64
+
65
+ ## Usage
66
+
67
+ ### With Hugging Face Transformers
68
 
69
  ```python
70
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
 
74
  model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
75
 
76
  classifier = pipeline(
77
+ "text-classification",
78
+ model=model,
79
+ tokenizer=tokenizer,
80
+ truncation=True,
81
+ max_length=512,
82
+ device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  )
84
 
85
+ print(classifier("Sorry, but I can't assist with that."))