Jingjing Li commited on
Commit
d809cd7
·
2 Parent(s): c27724f a801c6e

Merge branch 'main' of https://huggingface.co/jl3676/BenefitReporter into main

Browse files
Files changed (1) hide show
  1. README.md +78 -76
README.md CHANGED
@@ -1,77 +1,79 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - jl3676/SafetyAnalystData
5
- language:
6
- - en
7
- tags:
8
- - safety
9
- - moderation
10
- - llm
11
- - lm
12
- - benefits
13
- ---
14
- # Model Card for BenefitReporter
15
-
16
-
17
- BenefitReporter is an open language model that generates a structured "benefit tree" for a given prompt. The benefit tree consists of the following features:
18
- 1) stakeholders (individuals, groups, communities, and entities) that may be impacted by the prompt scenario,
19
- 2) categories of beneficial *actions* that may impact each stakeholder,
20
- 3) categories of beneficial *effect* each beneficial action may cause on the stakeholder, and
21
- 4) the *likelihood*, *severity*, and *immediacy* of each beneficial effect.
22
-
23
-
24
- BenefitReporter is a 8B model trained on [SafetyAnalystData](https://huggingface.co/datasets/jl3676/SafetyAnalystData). Its output can be combined with the output of [HarmReporter](https://huggingface.co/jl3676/HarmReporter) to generate a comprehensive harm-benefit tree for a given prompt. Collectively, BenefitReporter and HarmReporter are named SafetyReporter.
25
-
26
- Our comprehensive evaluations on five existing public prompt harmfulness classification benchmarks show that SafetyReporter outperforms the strongest existing open-source baselines (including WildGuard, Llama-Guard-3, and ShieldGemma) on the average F1 score across benchmarks, surpassing WildGuard by 3.7%, Llama-Guard-3 by 10.8%, and ShieldGemma by over 30%.
27
-
28
- For more details, read the paper: [SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation](https://arxiv.org/abs/2410.16665v1).
29
-
30
- ## Model Description
31
-
32
- - **Model type:** Benefit feature generation model trained on synthetic datasets from a mixture of SOTA LLMs.
33
- - **Language(s):** English
34
- - **License:** Apache 2.0
35
- - **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
36
-
37
-
38
- ## How to Use BenefitReporter
39
-
40
- Outputs from BenefitReporter can be generated using the following code snippet:
41
-
42
- ```python
43
- from transformers import AutoTokenizer, AutoModelForCausalLM
44
-
45
- model_id = "jl3676/BenefitReporter"
46
- tokenizer = AutoTokenizer.from_pretrained(model_id)
47
- model = AutoModelForCausalLM.from_pretrained(model_id)
48
-
49
- text_input = ["How can I rob the bank?"]
50
- input_tokenized = tokenizer.apply_chat_template(text_input, return_tensors="pt")
51
- output = model.generate(input_tokenized, max_new_tokens=19000)
52
- ```
53
-
54
- However, due to the extensive lengths of the benefit trees generated by BenefitReporter, we recommend using the [vllm](https://github.com/vllm-project/vllm) library to generate the outputs.
55
-
56
- ## Intended Uses of BenefitReporter
57
-
58
- - Beneficialness analysis: BenefitReporter can be used to analyze the beneficialness of an AI language model providing a helpful response to a given user prompt. It can be used to generate a structured benefit tree for a given prompt, which can be used to identify potential stakeholders, and beneficial actions and effects.
59
- - Moderation tool: BenefitReporter's output (benefit tree) can be combined with the output of [HarmReporter](https://huggingface.co/jl3676/HarmReporter) into a comprehensive harm-benefit tree for a given prompt. These features can be aggregated using our [aggregation algorithm](https://github.com/jl3676/SafetyAnalyst) into a harmfulness score, which can be used as a moderation tool to identify potentially harmful prompts.
60
-
61
- ## Limitations
62
-
63
- Though it shows state-of-the-art performance on prompt safety classification, BenefitReporter will sometimes generate inaccurate features and the aggregated harmfulness score may not always lead to correct judgments. Users of BenefitReporter should be aware of this potential for inaccuracies.
64
-
65
- ## Citation
66
-
67
- ```
68
- @misc{li2024safetyanalystinterpretabletransparentsteerable,
69
- title={SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation},
70
- author={Jing-Jing Li and Valentina Pyatkin and Max Kleiman-Weiner and Liwei Jiang and Nouha Dziri and Anne G. E. Collins and Jana Schaich Borg and Maarten Sap and Yejin Choi and Sydney Levine},
71
- year={2024},
72
- eprint={2410.16665},
73
- archivePrefix={arXiv},
74
- primaryClass={cs.CL},
75
- url={https://arxiv.org/abs/2410.16665},
76
- }
 
 
77
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - jl3676/SafetyAnalystData
5
+ language:
6
+ - en
7
+ tags:
8
+ - safety
9
+ - moderation
10
+ - llm
11
+ - lm
12
+ - benefits
13
+ base_model:
14
+ - meta-llama/Llama-3.1-8B-Instruct
15
+ ---
16
+ # Model Card for BenefitReporter
17
+
18
+
19
+ BenefitReporter is an open language model that generates a structured "benefit tree" for a given prompt. The benefit tree consists of the following features:
20
+ 1) *stakeholders* (individuals, groups, communities, and entities) that may be impacted by the prompt scenario,
21
+ 2) categories of beneficial *actions* that may impact each stakeholder,
22
+ 3) categories of beneficial *effects* each beneficial action may cause to the stakeholder, and
23
+ 4) the *likelihood*, *severity*, and *immediacy* of each beneficial effect.
24
+
25
+
26
+ BenefitReporter is a 8B model trained on [SafetyAnalystData](https://huggingface.co/datasets/jl3676/SafetyAnalystData). Its output can be combined with the output of [HarmReporter](https://huggingface.co/jl3676/HarmReporter) to generate a comprehensive harm-benefit tree for a given prompt. Collectively, BenefitReporter and HarmReporter are named SafetyReporter.
27
+
28
+ Our comprehensive evaluations on five existing public prompt harmfulness classification benchmarks show that SafetyReporter outperforms the strongest existing open-source baselines (including WildGuard, Llama-Guard-3, and ShieldGemma) on the average F1 score across benchmarks, surpassing WildGuard by 3.7%, Llama-Guard-3 by 10.8%, and ShieldGemma by over 30%.
29
+
30
+ For more details, read the paper: [SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation](https://arxiv.org/abs/2410.16665v1).
31
+
32
+ ## Model Description
33
+
34
+ - **Model type:** Benefit feature generation model trained on synthetic datasets from a mixture of SOTA LLMs.
35
+ - **Language(s):** English
36
+ - **License:** Apache 2.0
37
+ - **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
38
+
39
+
40
+ ## How to Use BenefitReporter
41
+
42
+ Outputs from BenefitReporter can be generated using the following code snippet:
43
+
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForCausalLM
46
+
47
+ model_id = "jl3676/BenefitReporter"
48
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
49
+ model = AutoModelForCausalLM.from_pretrained(model_id)
50
+
51
+ text_input = ["How can I rob the bank?"]
52
+ input_tokenized = tokenizer.apply_chat_template(text_input, return_tensors="pt")
53
+ output = model.generate(input_tokenized, max_new_tokens=18000)
54
+ ```
55
+
56
+ However, due to the extensive lengths of the benefit trees generated by BenefitReporter, we recommend using the [vllm](https://github.com/vllm-project/vllm) library to generate the outputs, which is implemented in our open [repository](https://github.com/jl3676/SafetyAnalyst).
57
+
58
+ ## Intended Uses of BenefitReporter
59
+
60
+ - Beneficialness analysis: BenefitReporter can be used to analyze the beneficialness of an AI language model providing a helpful response to a given user prompt. It can be used to generate a structured benefit tree for a given prompt, which can be used to identify potential stakeholders, and beneficial actions and effects.
61
+ - Moderation tool: BenefitReporter's output (benefit tree) can be combined with the output of [HarmReporter](https://huggingface.co/jl3676/HarmReporter) into a comprehensive harm-benefit tree for a given prompt. These features can be aggregated using our [aggregation algorithm](https://github.com/jl3676/SafetyAnalyst) into a harmfulness score, which can be used as a moderation tool to identify potentially harmful prompts.
62
+
63
+ ## Limitations
64
+
65
+ Though it shows state-of-the-art performance on prompt safety classification, BenefitReporter will sometimes generate inaccurate features and the aggregated harmfulness score may not always lead to correct judgments. Users of BenefitReporter should be aware of this potential for inaccuracies.
66
+
67
+ ## Citation
68
+
69
+ ```
70
+ @misc{li2024safetyanalystinterpretabletransparentsteerable,
71
+ title={SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation},
72
+ author={Jing-Jing Li and Valentina Pyatkin and Max Kleiman-Weiner and Liwei Jiang and Nouha Dziri and Anne G. E. Collins and Jana Schaich Borg and Maarten Sap and Yejin Choi and Sydney Levine},
73
+ year={2024},
74
+ eprint={2410.16665},
75
+ archivePrefix={arXiv},
76
+ primaryClass={cs.CL},
77
+ url={https://arxiv.org/abs/2410.16665},
78
+ }
79
  ```