sgugger Marissa commited on
Commit
e3789f3
·
1 Parent(s): 07e9f73

Preliminary model card (#1)

Browse files

- Preliminary model card (ac76c1c99e2a7af2cd37010dacbac602b16af324)
- Update README.md (c895c9e9622af5382daf130801d1a3a18af8142d)
- Update README.md (604fafd022553796c0a8cca154a0b6764d6a2db0)


Co-authored-by: Marissa Gerchick <[email protected]>

Files changed (1) hide show
  1. README.md +194 -0
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: bsd-3-clause
4
+ ---
5
+
6
+ # ctrl
7
+
8
+ # Table of Contents
9
+
10
+ 1. [Model Details](#model-details)
11
+ 2. [Uses](#uses)
12
+ 3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
13
+ 4. [Training](#training)
14
+ 5. [Evaluation](#evaluation)
15
+ 6. [Environmental Impact](#environmental-impact)
16
+ 7. [Technical Specifications](#technical-specifications)
17
+ 8. [Citation](#citation)
18
+ 9. [Model Card Authors](#model-card-authors)
19
+ 10. [How To Get Started With the Model](#how-to-get-started-with-the-model)
20
+
21
+
22
+ # Model Details
23
+
24
+ ## Model Description
25
+
26
+ The CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.). The model developers released a model card for CTRL, available [here](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf).
27
+
28
+ In their [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf), the developers write:
29
+
30
+ > The CTRL Language Model analyzed in this card generates text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior.
31
+
32
+ - **Developed by:** See [associated paper](https://arxiv.org/abs/1909.05858) from Salesforce Research
33
+ - **Model type:** Transformer-based language model
34
+ - **Language(s) (NLP):** Primarily English, some German, Spanish, French
35
+ - **License:** [BSD 3-Clause](https://github.com/salesforce/ctrl/blob/master/LICENSE.txt); also see [Code of Conduct](https://github.com/salesforce/ctrl)
36
+ - **Related Models:** More information needed
37
+ - **Parent Model:** More information needed
38
+ - **Resources for more information:**
39
+ - [Associated paper](https://arxiv.org/abs/1909.05858)
40
+ - [GitHub repo](https://github.com/salesforce/ctrl)
41
+ - [Developer Model Card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf)
42
+ - [Blog post](https://blog.salesforceairesearch.com/introducing-a-conditional-transformer-language-model-for-controllable-generation/)
43
+
44
+ # Uses
45
+
46
+ ## Direct Use
47
+
48
+ The model is a language model. The model can be used for text generation.
49
+
50
+ ## Downstream Use
51
+
52
+ In their [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf), the developers write that the primary intended users are general audiences and NLP Researchers, and that the primary intended uses are:
53
+
54
+ > 1. Generating artificial text in collaboration with a human, including but not limited to:
55
+ > - Creative writing
56
+ > - Automating repetitive writing tasks
57
+ > - Formatting specific text types
58
+ > - Creating contextualized marketing materials
59
+ > 2. Improvement of other NLP applications through fine-tuning (on another task or other data, e.g. fine-tuning CTRL to learn new kinds of language like product descriptions)
60
+ > 3. Enhancement in the field of natural language understanding to push towards a better understanding of artificial text generation, including how to detect it and work toward control, understanding, and potentially combating potentially negative consequences of such models.
61
+
62
+ ## Out-of-Scope Use
63
+
64
+ In their [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf), the developers write:
65
+
66
+ > - CTRL should not be used for generating artificial text without collaboration with a human.
67
+ > - It should not be used to make normative or prescriptive claims.
68
+ > - This software should not be used to promote or profit from:
69
+ > - violence, hate, and division;
70
+ > - environmental destruction;
71
+ > - abuse of human rights; or
72
+ > - the destruction of people's physical and mental health.
73
+
74
+ # Bias, Risks, and Limitations
75
+
76
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
77
+
78
+ In their [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf), the developers write:
79
+
80
+ > We recognize the potential for misuse or abuse, including use by bad actors who could manipulate the system to act maliciously and generate text to influence decision-making in political, economic, and social settings. False attribution could also harm individuals, organizations, or other entities. To address these concerns, the model was evaluated internally as well as externally by third parties, including the Partnership on AI, prior to release.
81
+
82
+ > To mitigate potential misuse to the extent possible, we stripped out all detectable training data from undesirable sources. We then redteamed the model and found that negative utterances were often placed in contexts that made them identifiable as such. For example, when using the ‘News’ control code, hate speech could be embedded as part of an apology (e.g. “the politician apologized for saying [insert hateful statement]”), implying that this type of speech was negative. By pre-selecting the available control codes (omitting, for example, Instagram and Twitter from the available domains), we are able to limit the potential for misuse.
83
+
84
+ > In releasing our model, we hope to put it into the hands of researchers and prosocial actors so that they can work to control, understand, and potentially combat the negative consequences of such models. We hope that research into detecting fake news and model-generated content of all kinds will be pushed forward by CTRL. It is our belief that these models should become a common tool so researchers can design methods to guard against malicious use and so the public becomes familiar with their existence and patterns of behavior.
85
+
86
+ See the [associated paper](https://arxiv.org/pdf/1909.05858.pdf) for further discussions about the ethics of LLMs.
87
+
88
+ ## Recommendations
89
+
90
+ In their [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf), the developers write:
91
+
92
+ > - A recommendation to monitor and detect use will be implemented through the development of a model that will identify CTRLgenerated text.
93
+ > - A second recommendation to further screen the input into and output from the model will be implemented through the addition of a check in the CTRL interface to prohibit the insertion into the model of certain negative inputs, which will help control the output that can be generated.
94
+ > - The model is trained on a limited number of languages: primarily English and some German, Spanish, French. A recommendation for a future area of research is to train the model on more languages.
95
+
96
+ See the [CTRL-detector GitHub repo](https://github.com/salesforce/ctrl-detector) for more on the detector model.
97
+
98
+ # Training
99
+
100
+ ## Training Data
101
+
102
+ In their [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf), the developers write:
103
+
104
+ > This model is trained on 140 GB of text drawn from a variety of domains: Wikipedia (English, German, Spanish, and French), Project Gutenberg, submissions from 45 subreddits, OpenWebText, a large collection of news data, Amazon Reviews, Europarl and UN data from WMT (En-De, En-Es, En-Fr), question-answer pairs (no context documents) from ELI5, and the MRQA shared task, which includes Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA, and Natural Questions. See the paper for the full list of training data.
105
+
106
+ ## Training Procedure
107
+
108
+ ### Preprocessing
109
+
110
+ In the [associated paper](https://arxiv.org/pdf/1909.05858.pdf) the developers write:
111
+
112
+ > We learn BPE (Sennrich et al., 2015) codes and tokenize the data using fastBPE4, but we use a large vocabulary of roughly 250K tokens. This includes the sub-word tokens necessary to mitigate problems with rare words, but it also reduces the average number of tokens required to generate long text by including most common words. We use English Wikipedia and a 5% split of our collected OpenWebText data for learning BPE codes. We also introduce an unknown token so that during preprocessing we can filter out sequences that contain more than 2 unknown tokens. This, along with the compressed storage for efficient training (TFRecords) (Abadi et al., 2016), reduces our training data to 140 GB from the total 180 GB collected.
113
+
114
+ See the paper for links, references, and further details.
115
+
116
+ ### Training
117
+
118
+ In the [associated paper](https://arxiv.org/pdf/1909.05858.pdf) the developers write:
119
+
120
+ > CTRL has model dimension d = 1280, inner dimension f = 8192, 48 layers, and 16 heads per layer. Dropout with probability 0.1 follows the residual connections in each layer. Token embeddings were tied with the final output embedding layer (Inan et al., 2016; Press & Wolf, 2016).
121
+
122
+ See the paper for links, references, and further details.
123
+
124
+ # Evaluation
125
+
126
+ ## Testing Data, Factors & Metrics
127
+
128
+ In their [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf), the developers write that model performance measures are:
129
+
130
+ > Performance evaluated on qualitative judgments by humans as to whether the control codes lead to text generated in the desired domain
131
+
132
+ # Environmental Impact
133
+
134
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). Details are pulled from the [associated paper](https://arxiv.org/pdf/1909.05858.pdf).
135
+
136
+ - **Hardware Type:** TPU v3 Pod
137
+ - **Hours used:** Approximately 336 hours (2 weeks)
138
+ - **Cloud Provider:** GCP
139
+ - **Compute Region:** More information needed
140
+ - **Carbon Emitted:** More information needed
141
+
142
+ # Technical Specifications
143
+
144
+ In the [associated paper](https://arxiv.org/pdf/1909.05858.pdf) the developers write:
145
+
146
+ > CTRL was implemented in TensorFlow (Abadi et al., 2016) and trained with a global batch size of 1024 distributed across 256 cores of a Cloud TPU v3 Pod for 800k iterations. Training took approximately 2 weeks using Adagrad (Duchi et al., 2011) with a linear warmup from 0 to 0.05 over 25k steps. The norm of gradients were clipped to 0.25 as in (Merity et al., 2017). Learning rate decay was not necessary due to the monotonic nature of the Adagrad accumulator. We compared to the Adam optimizer (Kingma & Ba, 2014) while training smaller models, but we noticed comparable convergence rates and significant memory savings with Adagrad. We also experimented with explicit memory-saving optimizers including SM3 (Anil et al., 2019), Adafactor (Shazeer & Stern, 2018), and NovoGrad (Ginsburg et al., 2019) with mixed results.
147
+
148
+ See the paper for links, references, and further details.
149
+
150
+ # Citation
151
+
152
+ **BibTeX:**
153
+
154
+ ```bibtex
155
+ @article{keskarCTRL2019,
156
+ title={{CTRL - A Conditional Transformer Language Model for Controllable Generation}},
157
+ author={Keskar, Nitish Shirish and McCann, Bryan and Varshney, Lav and Xiong, Caiming and Socher, Richard},
158
+ journal={arXiv preprint arXiv:1909.05858},
159
+ year={2019}
160
+ }
161
+ ```
162
+
163
+ **APA:**
164
+ - Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
165
+
166
+ # Model Card Authors
167
+
168
+ This model card was written by the team at Hugging Face, referencing the [model card](https://github.com/salesforce/ctrl/blob/master/ModelCard.pdf) released by the developers.
169
+
170
+ # How to Get Started with the Model
171
+
172
+ Use the code below to get started with the model. See the [Hugging Face ctrl docs](https://huggingface.co/docs/transformers/model_doc/ctrl) for more information.
173
+
174
+ <details>
175
+ <summary> Click to expand </summary>
176
+
177
+ ```python
178
+ >>> from transformers import CTRLTokenizer, CTRLModel
179
+ >>> import torch
180
+
181
+ >>> tokenizer = CTRLTokenizer.from_pretrained("ctrl")
182
+ >>> model = CTRLModel.from_pretrained("ctrl")
183
+
184
+ >>> # CTRL was trained with control codes as the first token
185
+ >>> inputs = tokenizer("Opinion My dog is cute", return_tensors="pt")
186
+ >>> assert inputs["input_ids"][0, 0].item() in tokenizer.control_codes.values()
187
+
188
+ >>> outputs = model(**inputs)
189
+
190
+ >>> last_hidden_states = outputs.last_hidden_state
191
+ >>> list(last_hidden_states.shape)
192
+ ```
193
+
194
+ </details>