Spaces:
Runtime error
Runtime error
NimaBoscarino
commited on
Commit
•
93711b8
1
Parent(s):
aae10fc
WIP: Add real model cards as test cases
Browse files- .gitignore +3 -0
- tests/cards/Helsinki-NLP___opus-mt-en-es.md +93 -0
- tests/cards/StanfordAIMI___stanford-deidentifier-base.md +23 -0
- tests/cards/albert-base-v2.md +263 -0
- tests/cards/bert-base-cased.md +220 -0
- tests/cards/bert-base-multilingual-cased.md +145 -0
- tests/cards/bert-base-uncased.md +241 -0
- tests/cards/cl-tohoku___bert-base-japanese-whole-word-masking.md +37 -0
- tests/cards/distilbert-base-cased-distilled-squad.md +179 -0
- tests/cards/distilbert-base-uncased-finetuned-sst-2-english.md +80 -0
- tests/cards/distilbert-base-uncased.md +208 -0
- tests/cards/distilroberta-base.md +175 -0
- tests/cards/emilyalsentzer___Bio_ClinicalBERT.md +37 -0
- tests/cards/facebook___bart-large-mnli.md +73 -0
- tests/cards/google___electra-base-discriminator.md +29 -0
- tests/cards/gpt2.md +158 -0
- tests/cards/jonatasgrosman___wav2vec2-large-xlsr-53-english.md +102 -0
- tests/cards/microsoft___layoutlmv3-base.md +29 -0
- tests/cards/openai___clip-vit-base-patch32.md +136 -0
- tests/cards/openai___clip-vit-large-patch14.md +136 -0
- tests/cards/philschmid___bart-large-cnn-samsum.md +62 -0
- tests/cards/prajjwal1___bert-tiny.md +46 -0
- tests/cards/roberta-base.md +224 -0
- tests/cards/roberta-large.md +225 -0
- tests/cards/runwayml___stable-diffusion-v1-5.md +188 -0
- tests/cards/sentence-transformers___all-MiniLM-L6-v2.md +142 -0
- tests/cards/t5-base.md +175 -0
- tests/cards/t5-small.md +175 -0
- tests/cards/xlm-roberta-base.md +99 -0
- tests/cards/xlm-roberta-large.md +99 -0
- tests/cards/yiyanghkust___finbert-tone.md +33 -0
- tests/conftest.py +61 -6
- tests/test_compliance_checks.py +7 -61
.gitignore
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
flagged/
|
2 |
+
gradio_cached_examples/
|
3 |
+
.idea/
|
tests/cards/Helsinki-NLP___opus-mt-en-es.md
ADDED
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
### eng-spa
|
2 |
+
|
3 |
+
* source group: English
|
4 |
+
* target group: Spanish
|
5 |
+
* OPUS readme: [eng-spa](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-spa/README.md)
|
6 |
+
|
7 |
+
* model: transformer
|
8 |
+
* source language(s): eng
|
9 |
+
* target language(s): spa
|
10 |
+
* model: transformer
|
11 |
+
* pre-processing: normalization + SentencePiece (spm32k,spm32k)
|
12 |
+
* download original weights: [opus-2020-08-18.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-spa/opus-2020-08-18.zip)
|
13 |
+
* test set translations: [opus-2020-08-18.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-spa/opus-2020-08-18.test.txt)
|
14 |
+
* test set scores: [opus-2020-08-18.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-spa/opus-2020-08-18.eval.txt)
|
15 |
+
|
16 |
+
## Benchmarks
|
17 |
+
|
18 |
+
| testset | BLEU | chr-F |
|
19 |
+
|-----------------------|-------|-------|
|
20 |
+
| newssyscomb2009-engspa.eng.spa | 31.0 | 0.583 |
|
21 |
+
| news-test2008-engspa.eng.spa | 29.7 | 0.564 |
|
22 |
+
| newstest2009-engspa.eng.spa | 30.2 | 0.578 |
|
23 |
+
| newstest2010-engspa.eng.spa | 36.9 | 0.620 |
|
24 |
+
| newstest2011-engspa.eng.spa | 38.2 | 0.619 |
|
25 |
+
| newstest2012-engspa.eng.spa | 39.0 | 0.625 |
|
26 |
+
| newstest2013-engspa.eng.spa | 35.0 | 0.598 |
|
27 |
+
| Tatoeba-test.eng.spa | 54.9 | 0.721 |
|
28 |
+
|
29 |
+
|
30 |
+
### System Info:
|
31 |
+
- hf_name: eng-spa
|
32 |
+
|
33 |
+
- source_languages: eng
|
34 |
+
|
35 |
+
- target_languages: spa
|
36 |
+
|
37 |
+
- opus_readme_url: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-spa/README.md
|
38 |
+
|
39 |
+
- original_repo: Tatoeba-Challenge
|
40 |
+
|
41 |
+
- tags: ['translation']
|
42 |
+
|
43 |
+
- languages: ['en', 'es']
|
44 |
+
|
45 |
+
- src_constituents: {'eng'}
|
46 |
+
|
47 |
+
- tgt_constituents: {'spa'}
|
48 |
+
|
49 |
+
- src_multilingual: False
|
50 |
+
|
51 |
+
- tgt_multilingual: False
|
52 |
+
|
53 |
+
- prepro: normalization + SentencePiece (spm32k,spm32k)
|
54 |
+
|
55 |
+
- url_model: https://object.pouta.csc.fi/Tatoeba-MT-models/eng-spa/opus-2020-08-18.zip
|
56 |
+
|
57 |
+
- url_test_set: https://object.pouta.csc.fi/Tatoeba-MT-models/eng-spa/opus-2020-08-18.test.txt
|
58 |
+
|
59 |
+
- src_alpha3: eng
|
60 |
+
|
61 |
+
- tgt_alpha3: spa
|
62 |
+
|
63 |
+
- short_pair: en-es
|
64 |
+
|
65 |
+
- chrF2_score: 0.721
|
66 |
+
|
67 |
+
- bleu: 54.9
|
68 |
+
|
69 |
+
- brevity_penalty: 0.978
|
70 |
+
|
71 |
+
- ref_len: 77311.0
|
72 |
+
|
73 |
+
- src_name: English
|
74 |
+
|
75 |
+
- tgt_name: Spanish
|
76 |
+
|
77 |
+
- train_date: 2020-08-18 00:00:00
|
78 |
+
|
79 |
+
- src_alpha2: en
|
80 |
+
|
81 |
+
- tgt_alpha2: es
|
82 |
+
|
83 |
+
- prefer_old: False
|
84 |
+
|
85 |
+
- long_pair: eng-spa
|
86 |
+
|
87 |
+
- helsinki_git_sha: d2f0910c89026c34a44e331e785dec1e0faa7b82
|
88 |
+
|
89 |
+
- transformers_git_sha: f7af09b4524b784d67ae8526f0e2fcc6f5ed0de9
|
90 |
+
|
91 |
+
- port_machine: brutasse
|
92 |
+
|
93 |
+
- port_time: 2020-08-24-18:20
|
tests/cards/StanfordAIMI___stanford-deidentifier-base.md
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Stanford de-identifier was trained on a variety of radiology and biomedical documents with the goal of automatising the de-identification process while reaching satisfactory accuracy for use in production. Manuscript in-proceedings.
|
2 |
+
|
3 |
+
These model weights are the recommended ones among all available deidentifier weights.
|
4 |
+
|
5 |
+
Associated github repo: https://github.com/MIDRC/Stanford_Penn_Deidentifier
|
6 |
+
|
7 |
+
## Citation
|
8 |
+
|
9 |
+
```bibtex
|
10 |
+
@article{10.1093/jamia/ocac219,
|
11 |
+
author = {Chambon, Pierre J and Wu, Christopher and Steinkamp, Jackson M and Adleberg, Jason and Cook, Tessa S and Langlotz, Curtis P},
|
12 |
+
title = "{Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods}",
|
13 |
+
journal = {Journal of the American Medical Informatics Association},
|
14 |
+
year = {2022},
|
15 |
+
month = {11},
|
16 |
+
abstract = "{To develop an automated deidentification pipeline for radiology reports that detect protected health information (PHI) entities and replaces them with realistic surrogates “hiding in plain sight.”In this retrospective study, 999 chest X-ray and CT reports collected between November 2019 and November 2020 were annotated for PHI at the token level and combined with 3001 X-rays and 2193 medical notes previously labeled, forming a large multi-institutional and cross-domain dataset of 6193 documents. Two radiology test sets, from a known and a new institution, as well as i2b2 2006 and 2014 test sets, served as an evaluation set to estimate model performance and to compare it with previously released deidentification tools. Several PHI detection models were developed based on different training datasets, fine-tuning approaches and data augmentation techniques, and a synthetic PHI generation algorithm. These models were compared using metrics such as precision, recall and F1 score, as well as paired samples Wilcoxon tests.Our best PHI detection model achieves 97.9 F1 score on radiology reports from a known institution, 99.6 from a new institution, 99.5 on i2b2 2006, and 98.9 on i2b2 2014. On reports from a known institution, it achieves 99.1 recall of detecting the core of each PHI span.Our model outperforms all deidentifiers it was compared to on all test sets as well as human labelers on i2b2 2014 data. It enables accurate and automatic deidentification of radiology reports.A transformer-based deidentification pipeline can achieve state-of-the-art performance for deidentifying radiology reports and other medical documents.}",
|
17 |
+
issn = {1527-974X},
|
18 |
+
doi = {10.1093/jamia/ocac219},
|
19 |
+
url = {https://doi.org/10.1093/jamia/ocac219},
|
20 |
+
note = {ocac219},
|
21 |
+
eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocac219/47220191/ocac219.pdf},
|
22 |
+
}
|
23 |
+
```
|
tests/cards/albert-base-v2.md
ADDED
@@ -0,0 +1,263 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ALBERT Base v2
|
2 |
+
|
3 |
+
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
|
4 |
+
[this paper](https://arxiv.org/abs/1909.11942) and first released in
|
5 |
+
[this repository](https://github.com/google-research/albert). This model, as all ALBERT models, is uncased: it does not make a difference
|
6 |
+
between english and English.
|
7 |
+
|
8 |
+
Disclaimer: The team releasing ALBERT did not write a model card for this model so this model card has been written by
|
9 |
+
the Hugging Face team.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
|
13 |
+
ALBERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
|
14 |
+
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
|
15 |
+
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
|
16 |
+
was pretrained with two objectives:
|
17 |
+
|
18 |
+
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
|
19 |
+
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
|
20 |
+
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
|
21 |
+
GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
|
22 |
+
sentence.
|
23 |
+
- Sentence Ordering Prediction (SOP): ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.
|
24 |
+
|
25 |
+
This way, the model learns an inner representation of the English language that can then be used to extract features
|
26 |
+
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
|
27 |
+
classifier using the features produced by the ALBERT model as inputs.
|
28 |
+
|
29 |
+
ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
|
30 |
+
|
31 |
+
This is the second version of the base model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.
|
32 |
+
|
33 |
+
This model has the following configuration:
|
34 |
+
|
35 |
+
- 12 repeating layers
|
36 |
+
- 128 embedding dimension
|
37 |
+
- 768 hidden dimension
|
38 |
+
- 12 attention heads
|
39 |
+
- 11M parameters
|
40 |
+
|
41 |
+
## Intended uses & limitations
|
42 |
+
|
43 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
44 |
+
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=albert) to look for
|
45 |
+
fine-tuned versions on a task that interests you.
|
46 |
+
|
47 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
48 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
49 |
+
generation you should look at model like GPT2.
|
50 |
+
|
51 |
+
### How to use
|
52 |
+
|
53 |
+
You can use this model directly with a pipeline for masked language modeling:
|
54 |
+
|
55 |
+
```python
|
56 |
+
>>> from transformers import pipeline
|
57 |
+
>>> unmasker = pipeline('fill-mask', model='albert-base-v2')
|
58 |
+
>>> unmasker("Hello I'm a [MASK] model.")
|
59 |
+
[
|
60 |
+
{
|
61 |
+
"sequence":"[CLS] hello i'm a modeling model.[SEP]",
|
62 |
+
"score":0.05816134437918663,
|
63 |
+
"token":12807,
|
64 |
+
"token_str":"▁modeling"
|
65 |
+
},
|
66 |
+
{
|
67 |
+
"sequence":"[CLS] hello i'm a modelling model.[SEP]",
|
68 |
+
"score":0.03748830780386925,
|
69 |
+
"token":23089,
|
70 |
+
"token_str":"▁modelling"
|
71 |
+
},
|
72 |
+
{
|
73 |
+
"sequence":"[CLS] hello i'm a model model.[SEP]",
|
74 |
+
"score":0.033725276589393616,
|
75 |
+
"token":1061,
|
76 |
+
"token_str":"▁model"
|
77 |
+
},
|
78 |
+
{
|
79 |
+
"sequence":"[CLS] hello i'm a runway model.[SEP]",
|
80 |
+
"score":0.017313428223133087,
|
81 |
+
"token":8014,
|
82 |
+
"token_str":"▁runway"
|
83 |
+
},
|
84 |
+
{
|
85 |
+
"sequence":"[CLS] hello i'm a lingerie model.[SEP]",
|
86 |
+
"score":0.014405295252799988,
|
87 |
+
"token":29104,
|
88 |
+
"token_str":"▁lingerie"
|
89 |
+
}
|
90 |
+
]
|
91 |
+
```
|
92 |
+
|
93 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
94 |
+
|
95 |
+
```python
|
96 |
+
from transformers import AlbertTokenizer, AlbertModel
|
97 |
+
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
|
98 |
+
model = AlbertModel.from_pretrained("albert-base-v2")
|
99 |
+
text = "Replace me by any text you'd like."
|
100 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
101 |
+
output = model(**encoded_input)
|
102 |
+
```
|
103 |
+
|
104 |
+
and in TensorFlow:
|
105 |
+
|
106 |
+
```python
|
107 |
+
from transformers import AlbertTokenizer, TFAlbertModel
|
108 |
+
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2'')
|
109 |
+
model = TFAlbertModel.from_pretrained("albert-base-v2)
|
110 |
+
text = "Replace me by any text you'd like."
|
111 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
112 |
+
output = model(encoded_input)
|
113 |
+
```
|
114 |
+
|
115 |
+
### Limitations and bias
|
116 |
+
|
117 |
+
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
118 |
+
predictions:
|
119 |
+
|
120 |
+
```python
|
121 |
+
>>> from transformers import pipeline
|
122 |
+
>>> unmasker = pipeline('fill-mask', model='albert-base-v2')
|
123 |
+
>>> unmasker("The man worked as a [MASK].")
|
124 |
+
|
125 |
+
[
|
126 |
+
{
|
127 |
+
"sequence":"[CLS] the man worked as a chauffeur.[SEP]",
|
128 |
+
"score":0.029577180743217468,
|
129 |
+
"token":28744,
|
130 |
+
"token_str":"▁chauffeur"
|
131 |
+
},
|
132 |
+
{
|
133 |
+
"sequence":"[CLS] the man worked as a janitor.[SEP]",
|
134 |
+
"score":0.028865724802017212,
|
135 |
+
"token":29477,
|
136 |
+
"token_str":"▁janitor"
|
137 |
+
},
|
138 |
+
{
|
139 |
+
"sequence":"[CLS] the man worked as a shoemaker.[SEP]",
|
140 |
+
"score":0.02581118606030941,
|
141 |
+
"token":29024,
|
142 |
+
"token_str":"▁shoemaker"
|
143 |
+
},
|
144 |
+
{
|
145 |
+
"sequence":"[CLS] the man worked as a blacksmith.[SEP]",
|
146 |
+
"score":0.01849772222340107,
|
147 |
+
"token":21238,
|
148 |
+
"token_str":"▁blacksmith"
|
149 |
+
},
|
150 |
+
{
|
151 |
+
"sequence":"[CLS] the man worked as a lawyer.[SEP]",
|
152 |
+
"score":0.01820771023631096,
|
153 |
+
"token":3672,
|
154 |
+
"token_str":"▁lawyer"
|
155 |
+
}
|
156 |
+
]
|
157 |
+
|
158 |
+
>>> unmasker("The woman worked as a [MASK].")
|
159 |
+
|
160 |
+
[
|
161 |
+
{
|
162 |
+
"sequence":"[CLS] the woman worked as a receptionist.[SEP]",
|
163 |
+
"score":0.04604868218302727,
|
164 |
+
"token":25331,
|
165 |
+
"token_str":"▁receptionist"
|
166 |
+
},
|
167 |
+
{
|
168 |
+
"sequence":"[CLS] the woman worked as a janitor.[SEP]",
|
169 |
+
"score":0.028220869600772858,
|
170 |
+
"token":29477,
|
171 |
+
"token_str":"▁janitor"
|
172 |
+
},
|
173 |
+
{
|
174 |
+
"sequence":"[CLS] the woman worked as a paramedic.[SEP]",
|
175 |
+
"score":0.0261906236410141,
|
176 |
+
"token":23386,
|
177 |
+
"token_str":"▁paramedic"
|
178 |
+
},
|
179 |
+
{
|
180 |
+
"sequence":"[CLS] the woman worked as a chauffeur.[SEP]",
|
181 |
+
"score":0.024797942489385605,
|
182 |
+
"token":28744,
|
183 |
+
"token_str":"▁chauffeur"
|
184 |
+
},
|
185 |
+
{
|
186 |
+
"sequence":"[CLS] the woman worked as a waitress.[SEP]",
|
187 |
+
"score":0.024124596267938614,
|
188 |
+
"token":13678,
|
189 |
+
"token_str":"▁waitress"
|
190 |
+
}
|
191 |
+
]
|
192 |
+
```
|
193 |
+
|
194 |
+
This bias will also affect all fine-tuned versions of this model.
|
195 |
+
|
196 |
+
## Training data
|
197 |
+
|
198 |
+
The ALBERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
|
199 |
+
unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
|
200 |
+
headers).
|
201 |
+
|
202 |
+
## Training procedure
|
203 |
+
|
204 |
+
### Preprocessing
|
205 |
+
|
206 |
+
The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 30,000. The inputs of the model are
|
207 |
+
then of the form:
|
208 |
+
|
209 |
+
```
|
210 |
+
[CLS] Sentence A [SEP] Sentence B [SEP]
|
211 |
+
```
|
212 |
+
|
213 |
+
### Training
|
214 |
+
|
215 |
+
The ALBERT procedure follows the BERT setup.
|
216 |
+
|
217 |
+
The details of the masking procedure for each sentence are the following:
|
218 |
+
- 15% of the tokens are masked.
|
219 |
+
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
220 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
221 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
222 |
+
|
223 |
+
## Evaluation results
|
224 |
+
|
225 |
+
When fine-tuned on downstream tasks, the ALBERT models achieve the following results:
|
226 |
+
|
227 |
+
| | Average | SQuAD1.1 | SQuAD2.0 | MNLI | SST-2 | RACE |
|
228 |
+
|----------------|----------|----------|----------|----------|----------|----------|
|
229 |
+
|V2 |
|
230 |
+
|ALBERT-base |82.3 |90.2/83.2 |82.1/79.3 |84.6 |92.9 |66.8 |
|
231 |
+
|ALBERT-large |85.7 |91.8/85.2 |84.9/81.8 |86.5 |94.9 |75.2 |
|
232 |
+
|ALBERT-xlarge |87.9 |92.9/86.4 |87.9/84.1 |87.9 |95.4 |80.7 |
|
233 |
+
|ALBERT-xxlarge |90.9 |94.6/89.1 |89.8/86.9 |90.6 |96.8 |86.8 |
|
234 |
+
|V1 |
|
235 |
+
|ALBERT-base |80.1 |89.3/82.3 | 80.0/77.1|81.6 |90.3 | 64.0 |
|
236 |
+
|ALBERT-large |82.4 |90.6/83.9 | 82.3/79.4|83.5 |91.7 | 68.5 |
|
237 |
+
|ALBERT-xlarge |85.5 |92.5/86.1 | 86.1/83.1|86.4 |92.4 | 74.8 |
|
238 |
+
|ALBERT-xxlarge |91.0 |94.8/89.3 | 90.2/87.4|90.8 |96.9 | 86.5 |
|
239 |
+
|
240 |
+
|
241 |
+
### BibTeX entry and citation info
|
242 |
+
|
243 |
+
```bibtex
|
244 |
+
@article{DBLP:journals/corr/abs-1909-11942,
|
245 |
+
author = {Zhenzhong Lan and
|
246 |
+
Mingda Chen and
|
247 |
+
Sebastian Goodman and
|
248 |
+
Kevin Gimpel and
|
249 |
+
Piyush Sharma and
|
250 |
+
Radu Soricut},
|
251 |
+
title = {{ALBERT:} {A} Lite {BERT} for Self-supervised Learning of Language
|
252 |
+
Representations},
|
253 |
+
journal = {CoRR},
|
254 |
+
volume = {abs/1909.11942},
|
255 |
+
year = {2019},
|
256 |
+
url = {http://arxiv.org/abs/1909.11942},
|
257 |
+
archivePrefix = {arXiv},
|
258 |
+
eprint = {1909.11942},
|
259 |
+
timestamp = {Fri, 27 Sep 2019 13:04:21 +0200},
|
260 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1909-11942.bib},
|
261 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
262 |
+
}
|
263 |
+
```
|
tests/cards/bert-base-cased.md
ADDED
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# BERT base model (cased)
|
2 |
+
|
3 |
+
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
|
4 |
+
[this paper](https://arxiv.org/abs/1810.04805) and first released in
|
5 |
+
[this repository](https://github.com/google-research/bert). This model is case-sensitive: it makes a difference between
|
6 |
+
english and English.
|
7 |
+
|
8 |
+
Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by
|
9 |
+
the Hugging Face team.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
|
13 |
+
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
|
14 |
+
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
|
15 |
+
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
|
16 |
+
was pretrained with two objectives:
|
17 |
+
|
18 |
+
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
|
19 |
+
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
|
20 |
+
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
|
21 |
+
GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
|
22 |
+
sentence.
|
23 |
+
- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
|
24 |
+
they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
|
25 |
+
predict if the two sentences were following each other or not.
|
26 |
+
|
27 |
+
This way, the model learns an inner representation of the English language that can then be used to extract features
|
28 |
+
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
|
29 |
+
classifier using the features produced by the BERT model as inputs.
|
30 |
+
|
31 |
+
## Intended uses & limitations
|
32 |
+
|
33 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
34 |
+
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
|
35 |
+
fine-tuned versions on a task that interests you.
|
36 |
+
|
37 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
38 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
39 |
+
generation you should look at model like GPT2.
|
40 |
+
|
41 |
+
### How to use
|
42 |
+
|
43 |
+
You can use this model directly with a pipeline for masked language modeling:
|
44 |
+
|
45 |
+
```python
|
46 |
+
>>> from transformers import pipeline
|
47 |
+
>>> unmasker = pipeline('fill-mask', model='bert-base-cased')
|
48 |
+
>>> unmasker("Hello I'm a [MASK] model.")
|
49 |
+
|
50 |
+
[{'sequence': "[CLS] Hello I'm a fashion model. [SEP]",
|
51 |
+
'score': 0.09019174426794052,
|
52 |
+
'token': 4633,
|
53 |
+
'token_str': 'fashion'},
|
54 |
+
{'sequence': "[CLS] Hello I'm a new model. [SEP]",
|
55 |
+
'score': 0.06349995732307434,
|
56 |
+
'token': 1207,
|
57 |
+
'token_str': 'new'},
|
58 |
+
{'sequence': "[CLS] Hello I'm a male model. [SEP]",
|
59 |
+
'score': 0.06228214129805565,
|
60 |
+
'token': 2581,
|
61 |
+
'token_str': 'male'},
|
62 |
+
{'sequence': "[CLS] Hello I'm a professional model. [SEP]",
|
63 |
+
'score': 0.0441727414727211,
|
64 |
+
'token': 1848,
|
65 |
+
'token_str': 'professional'},
|
66 |
+
{'sequence': "[CLS] Hello I'm a super model. [SEP]",
|
67 |
+
'score': 0.03326151892542839,
|
68 |
+
'token': 7688,
|
69 |
+
'token_str': 'super'}]
|
70 |
+
```
|
71 |
+
|
72 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
73 |
+
|
74 |
+
```python
|
75 |
+
from transformers import BertTokenizer, BertModel
|
76 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
|
77 |
+
model = BertModel.from_pretrained("bert-base-cased")
|
78 |
+
text = "Replace me by any text you'd like."
|
79 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
80 |
+
output = model(**encoded_input)
|
81 |
+
```
|
82 |
+
|
83 |
+
and in TensorFlow:
|
84 |
+
|
85 |
+
```python
|
86 |
+
from transformers import BertTokenizer, TFBertModel
|
87 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
|
88 |
+
model = TFBertModel.from_pretrained("bert-base-cased")
|
89 |
+
text = "Replace me by any text you'd like."
|
90 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
91 |
+
output = model(encoded_input)
|
92 |
+
```
|
93 |
+
|
94 |
+
### Limitations and bias
|
95 |
+
|
96 |
+
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
97 |
+
predictions:
|
98 |
+
|
99 |
+
```python
|
100 |
+
>>> from transformers import pipeline
|
101 |
+
>>> unmasker = pipeline('fill-mask', model='bert-base-cased')
|
102 |
+
>>> unmasker("The man worked as a [MASK].")
|
103 |
+
|
104 |
+
[{'sequence': '[CLS] The man worked as a lawyer. [SEP]',
|
105 |
+
'score': 0.04804691672325134,
|
106 |
+
'token': 4545,
|
107 |
+
'token_str': 'lawyer'},
|
108 |
+
{'sequence': '[CLS] The man worked as a waiter. [SEP]',
|
109 |
+
'score': 0.037494491785764694,
|
110 |
+
'token': 17989,
|
111 |
+
'token_str': 'waiter'},
|
112 |
+
{'sequence': '[CLS] The man worked as a cop. [SEP]',
|
113 |
+
'score': 0.035512614995241165,
|
114 |
+
'token': 9947,
|
115 |
+
'token_str': 'cop'},
|
116 |
+
{'sequence': '[CLS] The man worked as a detective. [SEP]',
|
117 |
+
'score': 0.031271643936634064,
|
118 |
+
'token': 9140,
|
119 |
+
'token_str': 'detective'},
|
120 |
+
{'sequence': '[CLS] The man worked as a doctor. [SEP]',
|
121 |
+
'score': 0.027423162013292313,
|
122 |
+
'token': 3995,
|
123 |
+
'token_str': 'doctor'}]
|
124 |
+
|
125 |
+
>>> unmasker("The woman worked as a [MASK].")
|
126 |
+
|
127 |
+
[{'sequence': '[CLS] The woman worked as a nurse. [SEP]',
|
128 |
+
'score': 0.16927455365657806,
|
129 |
+
'token': 7439,
|
130 |
+
'token_str': 'nurse'},
|
131 |
+
{'sequence': '[CLS] The woman worked as a waitress. [SEP]',
|
132 |
+
'score': 0.1501094549894333,
|
133 |
+
'token': 15098,
|
134 |
+
'token_str': 'waitress'},
|
135 |
+
{'sequence': '[CLS] The woman worked as a maid. [SEP]',
|
136 |
+
'score': 0.05600163713097572,
|
137 |
+
'token': 13487,
|
138 |
+
'token_str': 'maid'},
|
139 |
+
{'sequence': '[CLS] The woman worked as a housekeeper. [SEP]',
|
140 |
+
'score': 0.04838843643665314,
|
141 |
+
'token': 26458,
|
142 |
+
'token_str': 'housekeeper'},
|
143 |
+
{'sequence': '[CLS] The woman worked as a cook. [SEP]',
|
144 |
+
'score': 0.029980547726154327,
|
145 |
+
'token': 9834,
|
146 |
+
'token_str': 'cook'}]
|
147 |
+
```
|
148 |
+
|
149 |
+
This bias will also affect all fine-tuned versions of this model.
|
150 |
+
|
151 |
+
## Training data
|
152 |
+
|
153 |
+
The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
|
154 |
+
unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
|
155 |
+
headers).
|
156 |
+
|
157 |
+
## Training procedure
|
158 |
+
|
159 |
+
### Preprocessing
|
160 |
+
|
161 |
+
The texts are tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:
|
162 |
+
|
163 |
+
```
|
164 |
+
[CLS] Sentence A [SEP] Sentence B [SEP]
|
165 |
+
```
|
166 |
+
|
167 |
+
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
|
168 |
+
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
169 |
+
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
170 |
+
"sentences" has a combined length of less than 512 tokens.
|
171 |
+
|
172 |
+
The details of the masking procedure for each sentence are the following:
|
173 |
+
- 15% of the tokens are masked.
|
174 |
+
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
175 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
176 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
177 |
+
|
178 |
+
### Pretraining
|
179 |
+
|
180 |
+
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
|
181 |
+
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
|
182 |
+
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
|
183 |
+
learning rate warmup for 10,000 steps and linear decay of the learning rate after.
|
184 |
+
|
185 |
+
## Evaluation results
|
186 |
+
|
187 |
+
When fine-tuned on downstream tasks, this model achieves the following results:
|
188 |
+
|
189 |
+
Glue test results:
|
190 |
+
|
191 |
+
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|
192 |
+
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|
193 |
+
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
|
194 |
+
|
195 |
+
|
196 |
+
### BibTeX entry and citation info
|
197 |
+
|
198 |
+
```bibtex
|
199 |
+
@article{DBLP:journals/corr/abs-1810-04805,
|
200 |
+
author = {Jacob Devlin and
|
201 |
+
Ming{-}Wei Chang and
|
202 |
+
Kenton Lee and
|
203 |
+
Kristina Toutanova},
|
204 |
+
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
|
205 |
+
Understanding},
|
206 |
+
journal = {CoRR},
|
207 |
+
volume = {abs/1810.04805},
|
208 |
+
year = {2018},
|
209 |
+
url = {http://arxiv.org/abs/1810.04805},
|
210 |
+
archivePrefix = {arXiv},
|
211 |
+
eprint = {1810.04805},
|
212 |
+
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
|
213 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
|
214 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
215 |
+
}
|
216 |
+
```
|
217 |
+
|
218 |
+
<a href="https://huggingface.co/exbert/?model=bert-base-cased">
|
219 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
220 |
+
</a>
|
tests/cards/bert-base-multilingual-cased.md
ADDED
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# BERT multilingual base model (cased)
|
2 |
+
|
3 |
+
Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
|
4 |
+
It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in
|
5 |
+
[this repository](https://github.com/google-research/bert). This model is case sensitive: it makes a difference
|
6 |
+
between english and English.
|
7 |
+
|
8 |
+
Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by
|
9 |
+
the Hugging Face team.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
|
13 |
+
BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means
|
14 |
+
it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
|
15 |
+
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
|
16 |
+
was pretrained with two objectives:
|
17 |
+
|
18 |
+
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
|
19 |
+
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
|
20 |
+
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
|
21 |
+
GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
|
22 |
+
sentence.
|
23 |
+
- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
|
24 |
+
they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
|
25 |
+
predict if the two sentences were following each other or not.
|
26 |
+
|
27 |
+
This way, the model learns an inner representation of the languages in the training set that can then be used to
|
28 |
+
extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a
|
29 |
+
standard classifier using the features produced by the BERT model as inputs.
|
30 |
+
|
31 |
+
## Intended uses & limitations
|
32 |
+
|
33 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
34 |
+
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
|
35 |
+
fine-tuned versions on a task that interests you.
|
36 |
+
|
37 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
38 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
39 |
+
generation you should look at model like GPT2.
|
40 |
+
|
41 |
+
### How to use
|
42 |
+
|
43 |
+
You can use this model directly with a pipeline for masked language modeling:
|
44 |
+
|
45 |
+
```python
|
46 |
+
>>> from transformers import pipeline
|
47 |
+
>>> unmasker = pipeline('fill-mask', model='bert-base-multilingual-cased')
|
48 |
+
>>> unmasker("Hello I'm a [MASK] model.")
|
49 |
+
|
50 |
+
[{'sequence': "[CLS] Hello I'm a model model. [SEP]",
|
51 |
+
'score': 0.10182085633277893,
|
52 |
+
'token': 13192,
|
53 |
+
'token_str': 'model'},
|
54 |
+
{'sequence': "[CLS] Hello I'm a world model. [SEP]",
|
55 |
+
'score': 0.052126359194517136,
|
56 |
+
'token': 11356,
|
57 |
+
'token_str': 'world'},
|
58 |
+
{'sequence': "[CLS] Hello I'm a data model. [SEP]",
|
59 |
+
'score': 0.048930276185274124,
|
60 |
+
'token': 11165,
|
61 |
+
'token_str': 'data'},
|
62 |
+
{'sequence': "[CLS] Hello I'm a flight model. [SEP]",
|
63 |
+
'score': 0.02036019042134285,
|
64 |
+
'token': 23578,
|
65 |
+
'token_str': 'flight'},
|
66 |
+
{'sequence': "[CLS] Hello I'm a business model. [SEP]",
|
67 |
+
'score': 0.020079681649804115,
|
68 |
+
'token': 14155,
|
69 |
+
'token_str': 'business'}]
|
70 |
+
```
|
71 |
+
|
72 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
73 |
+
|
74 |
+
```python
|
75 |
+
from transformers import BertTokenizer, BertModel
|
76 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
|
77 |
+
model = BertModel.from_pretrained("bert-base-multilingual-cased")
|
78 |
+
text = "Replace me by any text you'd like."
|
79 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
80 |
+
output = model(**encoded_input)
|
81 |
+
```
|
82 |
+
|
83 |
+
and in TensorFlow:
|
84 |
+
|
85 |
+
```python
|
86 |
+
from transformers import BertTokenizer, TFBertModel
|
87 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
|
88 |
+
model = TFBertModel.from_pretrained("bert-base-multilingual-cased")
|
89 |
+
text = "Replace me by any text you'd like."
|
90 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
91 |
+
output = model(encoded_input)
|
92 |
+
```
|
93 |
+
|
94 |
+
## Training data
|
95 |
+
|
96 |
+
The BERT model was pretrained on the 104 languages with the largest Wikipedias. You can find the complete list
|
97 |
+
[here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
|
98 |
+
|
99 |
+
## Training procedure
|
100 |
+
|
101 |
+
### Preprocessing
|
102 |
+
|
103 |
+
The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a
|
104 |
+
larger Wikipedia are under-sampled and the ones with lower resources are oversampled. For languages like Chinese,
|
105 |
+
Japanese Kanji and Korean Hanja that don't have space, a CJK Unicode block is added around every character.
|
106 |
+
|
107 |
+
The inputs of the model are then of the form:
|
108 |
+
|
109 |
+
```
|
110 |
+
[CLS] Sentence A [SEP] Sentence B [SEP]
|
111 |
+
```
|
112 |
+
|
113 |
+
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
|
114 |
+
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
115 |
+
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
116 |
+
"sentences" has a combined length of less than 512 tokens.
|
117 |
+
|
118 |
+
The details of the masking procedure for each sentence are the following:
|
119 |
+
- 15% of the tokens are masked.
|
120 |
+
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
121 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
122 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
123 |
+
|
124 |
+
|
125 |
+
### BibTeX entry and citation info
|
126 |
+
|
127 |
+
```bibtex
|
128 |
+
@article{DBLP:journals/corr/abs-1810-04805,
|
129 |
+
author = {Jacob Devlin and
|
130 |
+
Ming{-}Wei Chang and
|
131 |
+
Kenton Lee and
|
132 |
+
Kristina Toutanova},
|
133 |
+
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
|
134 |
+
Understanding},
|
135 |
+
journal = {CoRR},
|
136 |
+
volume = {abs/1810.04805},
|
137 |
+
year = {2018},
|
138 |
+
url = {http://arxiv.org/abs/1810.04805},
|
139 |
+
archivePrefix = {arXiv},
|
140 |
+
eprint = {1810.04805},
|
141 |
+
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
|
142 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
|
143 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
144 |
+
}
|
145 |
+
```
|
tests/cards/bert-base-uncased.md
ADDED
@@ -0,0 +1,241 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# BERT base model (uncased)
|
2 |
+
|
3 |
+
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
|
4 |
+
[this paper](https://arxiv.org/abs/1810.04805) and first released in
|
5 |
+
[this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference
|
6 |
+
between english and English.
|
7 |
+
|
8 |
+
Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by
|
9 |
+
the Hugging Face team.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
|
13 |
+
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
|
14 |
+
was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
|
15 |
+
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
|
16 |
+
was pretrained with two objectives:
|
17 |
+
|
18 |
+
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
|
19 |
+
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
|
20 |
+
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
|
21 |
+
GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the
|
22 |
+
sentence.
|
23 |
+
- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
|
24 |
+
they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
|
25 |
+
predict if the two sentences were following each other or not.
|
26 |
+
|
27 |
+
This way, the model learns an inner representation of the English language that can then be used to extract features
|
28 |
+
useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard
|
29 |
+
classifier using the features produced by the BERT model as inputs.
|
30 |
+
|
31 |
+
## Model variations
|
32 |
+
|
33 |
+
BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.
|
34 |
+
Chinese and multilingual uncased and cased versions followed shortly after.
|
35 |
+
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.
|
36 |
+
Other 24 smaller models are released afterward.
|
37 |
+
|
38 |
+
The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
|
39 |
+
|
40 |
+
| Model | #params | Language |
|
41 |
+
|------------------------|--------------------------------|-------|
|
42 |
+
| [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) | 110M | English |
|
43 |
+
| [`bert-large-uncased`](https://huggingface.co/bert-large-uncased) | 340M | English | sub
|
44 |
+
| [`bert-base-cased`](https://huggingface.co/bert-base-cased) | 110M | English |
|
45 |
+
| [`bert-large-cased`](https://huggingface.co/bert-large-cased) | 340M | English |
|
46 |
+
| [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 110M | Chinese |
|
47 |
+
| [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) | 110M | Multiple |
|
48 |
+
| [`bert-large-uncased-whole-word-masking`](https://huggingface.co/bert-large-uncased-whole-word-masking) | 340M | English |
|
49 |
+
| [`bert-large-cased-whole-word-masking`](https://huggingface.co/bert-large-cased-whole-word-masking) | 340M | English |
|
50 |
+
|
51 |
+
## Intended uses & limitations
|
52 |
+
|
53 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
54 |
+
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
|
55 |
+
fine-tuned versions of a task that interests you.
|
56 |
+
|
57 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
58 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
59 |
+
generation you should look at model like GPT2.
|
60 |
+
|
61 |
+
### How to use
|
62 |
+
|
63 |
+
You can use this model directly with a pipeline for masked language modeling:
|
64 |
+
|
65 |
+
```python
|
66 |
+
>>> from transformers import pipeline
|
67 |
+
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
68 |
+
>>> unmasker("Hello I'm a [MASK] model.")
|
69 |
+
|
70 |
+
[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
|
71 |
+
'score': 0.1073106899857521,
|
72 |
+
'token': 4827,
|
73 |
+
'token_str': 'fashion'},
|
74 |
+
{'sequence': "[CLS] hello i'm a role model. [SEP]",
|
75 |
+
'score': 0.08774490654468536,
|
76 |
+
'token': 2535,
|
77 |
+
'token_str': 'role'},
|
78 |
+
{'sequence': "[CLS] hello i'm a new model. [SEP]",
|
79 |
+
'score': 0.05338378623127937,
|
80 |
+
'token': 2047,
|
81 |
+
'token_str': 'new'},
|
82 |
+
{'sequence': "[CLS] hello i'm a super model. [SEP]",
|
83 |
+
'score': 0.04667217284440994,
|
84 |
+
'token': 3565,
|
85 |
+
'token_str': 'super'},
|
86 |
+
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
|
87 |
+
'score': 0.027095865458250046,
|
88 |
+
'token': 2986,
|
89 |
+
'token_str': 'fine'}]
|
90 |
+
```
|
91 |
+
|
92 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
93 |
+
|
94 |
+
```python
|
95 |
+
from transformers import BertTokenizer, BertModel
|
96 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
97 |
+
model = BertModel.from_pretrained("bert-base-uncased")
|
98 |
+
text = "Replace me by any text you'd like."
|
99 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
100 |
+
output = model(**encoded_input)
|
101 |
+
```
|
102 |
+
|
103 |
+
and in TensorFlow:
|
104 |
+
|
105 |
+
```python
|
106 |
+
from transformers import BertTokenizer, TFBertModel
|
107 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
108 |
+
model = TFBertModel.from_pretrained("bert-base-uncased")
|
109 |
+
text = "Replace me by any text you'd like."
|
110 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
111 |
+
output = model(encoded_input)
|
112 |
+
```
|
113 |
+
|
114 |
+
### Limitations and bias
|
115 |
+
|
116 |
+
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
117 |
+
predictions:
|
118 |
+
|
119 |
+
```python
|
120 |
+
>>> from transformers import pipeline
|
121 |
+
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
122 |
+
>>> unmasker("The man worked as a [MASK].")
|
123 |
+
|
124 |
+
[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
|
125 |
+
'score': 0.09747550636529922,
|
126 |
+
'token': 10533,
|
127 |
+
'token_str': 'carpenter'},
|
128 |
+
{'sequence': '[CLS] the man worked as a waiter. [SEP]',
|
129 |
+
'score': 0.0523831807076931,
|
130 |
+
'token': 15610,
|
131 |
+
'token_str': 'waiter'},
|
132 |
+
{'sequence': '[CLS] the man worked as a barber. [SEP]',
|
133 |
+
'score': 0.04962705448269844,
|
134 |
+
'token': 13362,
|
135 |
+
'token_str': 'barber'},
|
136 |
+
{'sequence': '[CLS] the man worked as a mechanic. [SEP]',
|
137 |
+
'score': 0.03788609802722931,
|
138 |
+
'token': 15893,
|
139 |
+
'token_str': 'mechanic'},
|
140 |
+
{'sequence': '[CLS] the man worked as a salesman. [SEP]',
|
141 |
+
'score': 0.037680890411138535,
|
142 |
+
'token': 18968,
|
143 |
+
'token_str': 'salesman'}]
|
144 |
+
|
145 |
+
>>> unmasker("The woman worked as a [MASK].")
|
146 |
+
|
147 |
+
[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
|
148 |
+
'score': 0.21981462836265564,
|
149 |
+
'token': 6821,
|
150 |
+
'token_str': 'nurse'},
|
151 |
+
{'sequence': '[CLS] the woman worked as a waitress. [SEP]',
|
152 |
+
'score': 0.1597415804862976,
|
153 |
+
'token': 13877,
|
154 |
+
'token_str': 'waitress'},
|
155 |
+
{'sequence': '[CLS] the woman worked as a maid. [SEP]',
|
156 |
+
'score': 0.1154729500412941,
|
157 |
+
'token': 10850,
|
158 |
+
'token_str': 'maid'},
|
159 |
+
{'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
|
160 |
+
'score': 0.037968918681144714,
|
161 |
+
'token': 19215,
|
162 |
+
'token_str': 'prostitute'},
|
163 |
+
{'sequence': '[CLS] the woman worked as a cook. [SEP]',
|
164 |
+
'score': 0.03042375110089779,
|
165 |
+
'token': 5660,
|
166 |
+
'token_str': 'cook'}]
|
167 |
+
```
|
168 |
+
|
169 |
+
This bias will also affect all fine-tuned versions of this model.
|
170 |
+
|
171 |
+
## Training data
|
172 |
+
|
173 |
+
The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
|
174 |
+
unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
|
175 |
+
headers).
|
176 |
+
|
177 |
+
## Training procedure
|
178 |
+
|
179 |
+
### Preprocessing
|
180 |
+
|
181 |
+
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
|
182 |
+
then of the form:
|
183 |
+
|
184 |
+
```
|
185 |
+
[CLS] Sentence A [SEP] Sentence B [SEP]
|
186 |
+
```
|
187 |
+
|
188 |
+
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
|
189 |
+
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
190 |
+
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
191 |
+
"sentences" has a combined length of less than 512 tokens.
|
192 |
+
|
193 |
+
The details of the masking procedure for each sentence are the following:
|
194 |
+
- 15% of the tokens are masked.
|
195 |
+
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
196 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
197 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
198 |
+
|
199 |
+
### Pretraining
|
200 |
+
|
201 |
+
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
|
202 |
+
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
|
203 |
+
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
|
204 |
+
learning rate warmup for 10,000 steps and linear decay of the learning rate after.
|
205 |
+
|
206 |
+
## Evaluation results
|
207 |
+
|
208 |
+
When fine-tuned on downstream tasks, this model achieves the following results:
|
209 |
+
|
210 |
+
Glue test results:
|
211 |
+
|
212 |
+
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|
213 |
+
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|
214 |
+
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
|
215 |
+
|
216 |
+
|
217 |
+
### BibTeX entry and citation info
|
218 |
+
|
219 |
+
```bibtex
|
220 |
+
@article{DBLP:journals/corr/abs-1810-04805,
|
221 |
+
author = {Jacob Devlin and
|
222 |
+
Ming{-}Wei Chang and
|
223 |
+
Kenton Lee and
|
224 |
+
Kristina Toutanova},
|
225 |
+
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
|
226 |
+
Understanding},
|
227 |
+
journal = {CoRR},
|
228 |
+
volume = {abs/1810.04805},
|
229 |
+
year = {2018},
|
230 |
+
url = {http://arxiv.org/abs/1810.04805},
|
231 |
+
archivePrefix = {arXiv},
|
232 |
+
eprint = {1810.04805},
|
233 |
+
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
|
234 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
|
235 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
236 |
+
}
|
237 |
+
```
|
238 |
+
|
239 |
+
<a href="https://huggingface.co/exbert/?model=bert-base-uncased">
|
240 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
241 |
+
</a>
|
tests/cards/cl-tohoku___bert-base-japanese-whole-word-masking.md
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# BERT base Japanese (IPA dictionary, whole word masking enabled)
|
2 |
+
|
3 |
+
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
|
4 |
+
|
5 |
+
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
|
6 |
+
Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
|
7 |
+
|
8 |
+
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
|
9 |
+
|
10 |
+
## Model architecture
|
11 |
+
|
12 |
+
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
|
13 |
+
|
14 |
+
## Training Data
|
15 |
+
|
16 |
+
The model is trained on Japanese Wikipedia as of September 1, 2019.
|
17 |
+
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
|
18 |
+
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
|
19 |
+
|
20 |
+
## Tokenization
|
21 |
+
|
22 |
+
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.
|
23 |
+
The vocabulary size is 32000.
|
24 |
+
|
25 |
+
## Training
|
26 |
+
|
27 |
+
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
|
28 |
+
|
29 |
+
For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.
|
30 |
+
|
31 |
+
## Licenses
|
32 |
+
|
33 |
+
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
|
34 |
+
|
35 |
+
## Acknowledgments
|
36 |
+
|
37 |
+
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
|
tests/cards/distilbert-base-cased-distilled-squad.md
ADDED
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# DistilBERT base cased distilled SQuAD
|
2 |
+
|
3 |
+
## Table of Contents
|
4 |
+
- [Model Details](#model-details)
|
5 |
+
- [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
6 |
+
- [Uses](#uses)
|
7 |
+
- [Risks, Limitations and Biases](#risks-limitations-and-biases)
|
8 |
+
- [Training](#training)
|
9 |
+
- [Evaluation](#evaluation)
|
10 |
+
- [Environmental Impact](#environmental-impact)
|
11 |
+
- [Technical Specifications](#technical-specifications)
|
12 |
+
- [Citation Information](#citation-information)
|
13 |
+
- [Model Card Authors](#model-card-authors)
|
14 |
+
|
15 |
+
## Model Details
|
16 |
+
|
17 |
+
**Model Description:** The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, adistilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, adistilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than *bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.
|
18 |
+
|
19 |
+
This model is a fine-tune checkpoint of [DistilBERT-base-cased](https://huggingface.co/distilbert-base-cased), fine-tuned using (a second step of) knowledge distillation on [SQuAD v1.1](https://huggingface.co/datasets/squad).
|
20 |
+
|
21 |
+
- **Developed by:** Hugging Face
|
22 |
+
- **Model Type:** Transformer-based language model
|
23 |
+
- **Language(s):** English
|
24 |
+
- **License:** Apache 2.0
|
25 |
+
- **Related Models:** [DistilBERT-base-cased](https://huggingface.co/distilbert-base-cased)
|
26 |
+
- **Resources for more information:**
|
27 |
+
- See [this repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) for more about Distil\* (a class of compressed models including this model)
|
28 |
+
- See [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) for more information about knowledge distillation and the training procedure
|
29 |
+
|
30 |
+
## How to Get Started with the Model
|
31 |
+
|
32 |
+
Use the code below to get started with the model.
|
33 |
+
|
34 |
+
```python
|
35 |
+
>>> from transformers import pipeline
|
36 |
+
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
|
37 |
+
|
38 |
+
>>> context = r"""
|
39 |
+
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
|
40 |
+
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
|
41 |
+
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
|
42 |
+
... """
|
43 |
+
|
44 |
+
>>> result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
|
45 |
+
>>> print(
|
46 |
+
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
|
47 |
+
...)
|
48 |
+
|
49 |
+
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
|
50 |
+
```
|
51 |
+
|
52 |
+
Here is how to use this model in PyTorch:
|
53 |
+
|
54 |
+
```python
|
55 |
+
from transformers import DistilBertTokenizer, DistilBertModel
|
56 |
+
import torch
|
57 |
+
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
|
58 |
+
model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')
|
59 |
+
|
60 |
+
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
|
61 |
+
|
62 |
+
inputs = tokenizer(question, text, return_tensors="pt")
|
63 |
+
with torch.no_grad():
|
64 |
+
outputs = model(**inputs)
|
65 |
+
|
66 |
+
print(outputs)
|
67 |
+
```
|
68 |
+
|
69 |
+
And in TensorFlow:
|
70 |
+
|
71 |
+
```python
|
72 |
+
from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
|
73 |
+
import tensorflow as tf
|
74 |
+
|
75 |
+
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
|
76 |
+
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")
|
77 |
+
|
78 |
+
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
|
79 |
+
|
80 |
+
inputs = tokenizer(question, text, return_tensors="tf")
|
81 |
+
outputs = model(**inputs)
|
82 |
+
|
83 |
+
answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
|
84 |
+
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
|
85 |
+
|
86 |
+
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
|
87 |
+
tokenizer.decode(predict_answer_tokens)
|
88 |
+
```
|
89 |
+
|
90 |
+
## Uses
|
91 |
+
|
92 |
+
This model can be used for question answering.
|
93 |
+
|
94 |
+
#### Misuse and Out-of-scope Use
|
95 |
+
|
96 |
+
The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
|
97 |
+
|
98 |
+
## Risks, Limitations and Biases
|
99 |
+
|
100 |
+
**CONTENT WARNING: Readers should be aware that language generated by this model can be disturbing or offensive to some and can propagate historical and current stereotypes.**
|
101 |
+
|
102 |
+
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. For example:
|
103 |
+
|
104 |
+
|
105 |
+
```python
|
106 |
+
>>> from transformers import pipeline
|
107 |
+
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
|
108 |
+
|
109 |
+
>>> context = r"""
|
110 |
+
... Alice is sitting on the bench. Bob is sitting next to her.
|
111 |
+
... """
|
112 |
+
|
113 |
+
>>> result = question_answerer(question="Who is the CEO?", context=context)
|
114 |
+
>>> print(
|
115 |
+
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
|
116 |
+
...)
|
117 |
+
|
118 |
+
Answer: 'Bob', score: 0.7527, start: 32, end: 35
|
119 |
+
```
|
120 |
+
|
121 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
|
122 |
+
|
123 |
+
## Training
|
124 |
+
|
125 |
+
#### Training Data
|
126 |
+
|
127 |
+
The [distilbert-base-cased model](https://huggingface.co/distilbert-base-cased) was trained using the same data as the [distilbert-base-uncased model](https://huggingface.co/distilbert-base-uncased). The [distilbert-base-uncased model](https://huggingface.co/distilbert-base-uncased) model describes it's training data as:
|
128 |
+
|
129 |
+
> DistilBERT pretrained on the same data as BERT, which is [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).
|
130 |
+
|
131 |
+
To learn more about the SQuAD v1.1 dataset, see the [SQuAD v1.1 data card](https://huggingface.co/datasets/squad).
|
132 |
+
|
133 |
+
#### Training Procedure
|
134 |
+
|
135 |
+
##### Preprocessing
|
136 |
+
|
137 |
+
See the [distilbert-base-cased model card](https://huggingface.co/distilbert-base-cased) for further details.
|
138 |
+
|
139 |
+
##### Pretraining
|
140 |
+
|
141 |
+
See the [distilbert-base-cased model card](https://huggingface.co/distilbert-base-cased) for further details.
|
142 |
+
|
143 |
+
## Evaluation
|
144 |
+
|
145 |
+
As discussed in the [model repository](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)
|
146 |
+
|
147 |
+
> This model reaches a F1 score of 87.1 on the [SQuAD v1.1] dev set (for comparison, BERT bert-base-cased version reaches a F1 score of 88.7).
|
148 |
+
|
149 |
+
## Environmental Impact
|
150 |
+
|
151 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). We present the hardware type and hours used based on the [associated paper](https://arxiv.org/pdf/1910.01108.pdf). Note that these details are just for training DistilBERT, not including the fine-tuning with SQuAD.
|
152 |
+
|
153 |
+
- **Hardware Type:** 8 16GB V100 GPUs
|
154 |
+
- **Hours used:** 90 hours
|
155 |
+
- **Cloud Provider:** Unknown
|
156 |
+
- **Compute Region:** Unknown
|
157 |
+
- **Carbon Emitted:** Unknown
|
158 |
+
|
159 |
+
## Technical Specifications
|
160 |
+
|
161 |
+
See the [associated paper](https://arxiv.org/abs/1910.01108) for details on the modeling architecture, objective, compute infrastructure, and training details.
|
162 |
+
|
163 |
+
## Citation Information
|
164 |
+
|
165 |
+
```bibtex
|
166 |
+
@inproceedings{sanh2019distilbert,
|
167 |
+
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
|
168 |
+
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
|
169 |
+
booktitle={NeurIPS EMC^2 Workshop},
|
170 |
+
year={2019}
|
171 |
+
}
|
172 |
+
```
|
173 |
+
|
174 |
+
APA:
|
175 |
+
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
|
176 |
+
|
177 |
+
## Model Card Authors
|
178 |
+
|
179 |
+
This model card was written by the Hugging Face team.
|
tests/cards/distilbert-base-uncased-finetuned-sst-2-english.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# DistilBERT base uncased finetuned SST-2
|
2 |
+
|
3 |
+
## Table of Contents
|
4 |
+
- [Model Details](#model-details)
|
5 |
+
- [How to Get Started With the Model](#how-to-get-started-with-the-model)
|
6 |
+
- [Uses](#uses)
|
7 |
+
- [Risks, Limitations and Biases](#risks-limitations-and-biases)
|
8 |
+
- [Training](#training)
|
9 |
+
|
10 |
+
## Model Details
|
11 |
+
**Model Description:** This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned on SST-2.
|
12 |
+
This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).
|
13 |
+
- **Developed by:** Hugging Face
|
14 |
+
- **Model Type:** Text Classification
|
15 |
+
- **Language(s):** English
|
16 |
+
- **License:** Apache-2.0
|
17 |
+
- **Parent Model:** For more details about DistilBERT, we encourage users to check out [this model card](https://huggingface.co/distilbert-base-uncased).
|
18 |
+
- **Resources for more information:**
|
19 |
+
- [Model Documentation](https://huggingface.co/docs/transformers/main/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification)
|
20 |
+
|
21 |
+
## How to Get Started With the Model
|
22 |
+
|
23 |
+
Example of single-label classification:
|
24 |
+
|
25 |
+
```python
|
26 |
+
import torch
|
27 |
+
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
|
28 |
+
|
29 |
+
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
|
30 |
+
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
|
31 |
+
|
32 |
+
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
|
33 |
+
with torch.no_grad():
|
34 |
+
logits = model(**inputs).logits
|
35 |
+
|
36 |
+
predicted_class_id = logits.argmax().item()
|
37 |
+
model.config.id2label[predicted_class_id]
|
38 |
+
|
39 |
+
```
|
40 |
+
|
41 |
+
## Uses
|
42 |
+
|
43 |
+
#### Direct Use
|
44 |
+
|
45 |
+
This model can be used for topic classification. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.
|
46 |
+
|
47 |
+
#### Misuse and Out-of-scope Use
|
48 |
+
The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
|
49 |
+
|
50 |
+
|
51 |
+
## Risks, Limitations and Biases
|
52 |
+
|
53 |
+
Based on a few experimentations, we observed that this model could produce biased predictions that target underrepresented populations.
|
54 |
+
|
55 |
+
For instance, for sentences like `This film was filmed in COUNTRY`, this binary classification model will give radically different probabilities for the positive label depending on the country (0.89 if the country is France, but 0.08 if the country is Afghanistan) when nothing in the input indicates such a strong semantic shift. In this [colab](https://colab.research.google.com/gist/ageron/fb2f64fb145b4bc7c49efc97e5f114d3/biasmap.ipynb), [Aurélien Géron](https://twitter.com/aureliengeron) made an interesting map plotting these probabilities for each country.
|
56 |
+
|
57 |
+
<img src="https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/map.jpeg" alt="Map of positive probabilities per country." width="500"/>
|
58 |
+
|
59 |
+
We strongly advise users to thoroughly probe these aspects on their use-cases in order to evaluate the risks of this model. We recommend looking at the following bias evaluation datasets as a place to start: [WinoBias](https://huggingface.co/datasets/wino_bias), [WinoGender](https://huggingface.co/datasets/super_glue), [Stereoset](https://huggingface.co/datasets/stereoset).
|
60 |
+
|
61 |
+
|
62 |
+
|
63 |
+
# Training
|
64 |
+
|
65 |
+
|
66 |
+
#### Training Data
|
67 |
+
|
68 |
+
|
69 |
+
The authors use the following Stanford Sentiment Treebank([sst2](https://huggingface.co/datasets/sst2)) corpora for the model.
|
70 |
+
|
71 |
+
#### Training Procedure
|
72 |
+
|
73 |
+
###### Fine-tuning hyper-parameters
|
74 |
+
|
75 |
+
|
76 |
+
- learning_rate = 1e-5
|
77 |
+
- batch_size = 32
|
78 |
+
- warmup = 600
|
79 |
+
- max_seq_length = 128
|
80 |
+
- num_train_epochs = 3.0
|
tests/cards/distilbert-base-uncased.md
ADDED
@@ -0,0 +1,208 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# DistilBERT base model (uncased)
|
2 |
+
|
3 |
+
This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-uncased). It was
|
4 |
+
introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found
|
5 |
+
[here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). This model is uncased: it does
|
6 |
+
not make a difference between english and English.
|
7 |
+
|
8 |
+
## Model description
|
9 |
+
|
10 |
+
DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a
|
11 |
+
self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only,
|
12 |
+
with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic
|
13 |
+
process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained
|
14 |
+
with three objectives:
|
15 |
+
|
16 |
+
- Distillation loss: the model was trained to return the same probabilities as the BERT base model.
|
17 |
+
- Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a
|
18 |
+
sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the
|
19 |
+
model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that
|
20 |
+
usually see the words one after the other, or from autoregressive models like GPT which internally mask the future
|
21 |
+
tokens. It allows the model to learn a bidirectional representation of the sentence.
|
22 |
+
- Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base
|
23 |
+
model.
|
24 |
+
|
25 |
+
This way, the model learns the same inner representation of the English language than its teacher model, while being
|
26 |
+
faster for inference or downstream tasks.
|
27 |
+
|
28 |
+
## Intended uses & limitations
|
29 |
+
|
30 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
31 |
+
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=distilbert) to look for
|
32 |
+
fine-tuned versions on a task that interests you.
|
33 |
+
|
34 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
35 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
36 |
+
generation you should look at model like GPT2.
|
37 |
+
|
38 |
+
### How to use
|
39 |
+
|
40 |
+
You can use this model directly with a pipeline for masked language modeling:
|
41 |
+
|
42 |
+
```python
|
43 |
+
>>> from transformers import pipeline
|
44 |
+
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
|
45 |
+
>>> unmasker("Hello I'm a [MASK] model.")
|
46 |
+
|
47 |
+
[{'sequence': "[CLS] hello i'm a role model. [SEP]",
|
48 |
+
'score': 0.05292855575680733,
|
49 |
+
'token': 2535,
|
50 |
+
'token_str': 'role'},
|
51 |
+
{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
|
52 |
+
'score': 0.03968575969338417,
|
53 |
+
'token': 4827,
|
54 |
+
'token_str': 'fashion'},
|
55 |
+
{'sequence': "[CLS] hello i'm a business model. [SEP]",
|
56 |
+
'score': 0.034743521362543106,
|
57 |
+
'token': 2449,
|
58 |
+
'token_str': 'business'},
|
59 |
+
{'sequence': "[CLS] hello i'm a model model. [SEP]",
|
60 |
+
'score': 0.03462274372577667,
|
61 |
+
'token': 2944,
|
62 |
+
'token_str': 'model'},
|
63 |
+
{'sequence': "[CLS] hello i'm a modeling model. [SEP]",
|
64 |
+
'score': 0.018145186826586723,
|
65 |
+
'token': 11643,
|
66 |
+
'token_str': 'modeling'}]
|
67 |
+
```
|
68 |
+
|
69 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
70 |
+
|
71 |
+
```python
|
72 |
+
from transformers import DistilBertTokenizer, DistilBertModel
|
73 |
+
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
|
74 |
+
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
|
75 |
+
text = "Replace me by any text you'd like."
|
76 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
77 |
+
output = model(**encoded_input)
|
78 |
+
```
|
79 |
+
|
80 |
+
and in TensorFlow:
|
81 |
+
|
82 |
+
```python
|
83 |
+
from transformers import DistilBertTokenizer, TFDistilBertModel
|
84 |
+
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
|
85 |
+
model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
|
86 |
+
text = "Replace me by any text you'd like."
|
87 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
88 |
+
output = model(encoded_input)
|
89 |
+
```
|
90 |
+
|
91 |
+
### Limitations and bias
|
92 |
+
|
93 |
+
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
94 |
+
predictions. It also inherits some of
|
95 |
+
[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).
|
96 |
+
|
97 |
+
```python
|
98 |
+
>>> from transformers import pipeline
|
99 |
+
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
|
100 |
+
>>> unmasker("The White man worked as a [MASK].")
|
101 |
+
|
102 |
+
[{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]',
|
103 |
+
'score': 0.1235365942120552,
|
104 |
+
'token': 20987,
|
105 |
+
'token_str': 'blacksmith'},
|
106 |
+
{'sequence': '[CLS] the white man worked as a carpenter. [SEP]',
|
107 |
+
'score': 0.10142576694488525,
|
108 |
+
'token': 10533,
|
109 |
+
'token_str': 'carpenter'},
|
110 |
+
{'sequence': '[CLS] the white man worked as a farmer. [SEP]',
|
111 |
+
'score': 0.04985016956925392,
|
112 |
+
'token': 7500,
|
113 |
+
'token_str': 'farmer'},
|
114 |
+
{'sequence': '[CLS] the white man worked as a miner. [SEP]',
|
115 |
+
'score': 0.03932540491223335,
|
116 |
+
'token': 18594,
|
117 |
+
'token_str': 'miner'},
|
118 |
+
{'sequence': '[CLS] the white man worked as a butcher. [SEP]',
|
119 |
+
'score': 0.03351764753460884,
|
120 |
+
'token': 14998,
|
121 |
+
'token_str': 'butcher'}]
|
122 |
+
|
123 |
+
>>> unmasker("The Black woman worked as a [MASK].")
|
124 |
+
|
125 |
+
[{'sequence': '[CLS] the black woman worked as a waitress. [SEP]',
|
126 |
+
'score': 0.13283951580524445,
|
127 |
+
'token': 13877,
|
128 |
+
'token_str': 'waitress'},
|
129 |
+
{'sequence': '[CLS] the black woman worked as a nurse. [SEP]',
|
130 |
+
'score': 0.12586183845996857,
|
131 |
+
'token': 6821,
|
132 |
+
'token_str': 'nurse'},
|
133 |
+
{'sequence': '[CLS] the black woman worked as a maid. [SEP]',
|
134 |
+
'score': 0.11708822101354599,
|
135 |
+
'token': 10850,
|
136 |
+
'token_str': 'maid'},
|
137 |
+
{'sequence': '[CLS] the black woman worked as a prostitute. [SEP]',
|
138 |
+
'score': 0.11499975621700287,
|
139 |
+
'token': 19215,
|
140 |
+
'token_str': 'prostitute'},
|
141 |
+
{'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]',
|
142 |
+
'score': 0.04722772538661957,
|
143 |
+
'token': 22583,
|
144 |
+
'token_str': 'housekeeper'}]
|
145 |
+
```
|
146 |
+
|
147 |
+
This bias will also affect all fine-tuned versions of this model.
|
148 |
+
|
149 |
+
## Training data
|
150 |
+
|
151 |
+
DistilBERT pretrained on the same data as BERT, which is [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset
|
152 |
+
consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
|
153 |
+
(excluding lists, tables and headers).
|
154 |
+
|
155 |
+
## Training procedure
|
156 |
+
|
157 |
+
### Preprocessing
|
158 |
+
|
159 |
+
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
|
160 |
+
then of the form:
|
161 |
+
|
162 |
+
```
|
163 |
+
[CLS] Sentence A [SEP] Sentence B [SEP]
|
164 |
+
```
|
165 |
+
|
166 |
+
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
|
167 |
+
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
168 |
+
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
169 |
+
"sentences" has a combined length of less than 512 tokens.
|
170 |
+
|
171 |
+
The details of the masking procedure for each sentence are the following:
|
172 |
+
- 15% of the tokens are masked.
|
173 |
+
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
174 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
175 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
176 |
+
|
177 |
+
### Pretraining
|
178 |
+
|
179 |
+
The model was trained on 8 16 GB V100 for 90 hours. See the
|
180 |
+
[training code](https://github.com/huggingface/transformers/tree/master/examples/distillation) for all hyperparameters
|
181 |
+
details.
|
182 |
+
|
183 |
+
## Evaluation results
|
184 |
+
|
185 |
+
When fine-tuned on downstream tasks, this model achieves the following results:
|
186 |
+
|
187 |
+
Glue test results:
|
188 |
+
|
189 |
+
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|
190 |
+
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
|
191 |
+
| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |
|
192 |
+
|
193 |
+
|
194 |
+
### BibTeX entry and citation info
|
195 |
+
|
196 |
+
```bibtex
|
197 |
+
@article{Sanh2019DistilBERTAD,
|
198 |
+
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
|
199 |
+
author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
|
200 |
+
journal={ArXiv},
|
201 |
+
year={2019},
|
202 |
+
volume={abs/1910.01108}
|
203 |
+
}
|
204 |
+
```
|
205 |
+
|
206 |
+
<a href="https://huggingface.co/exbert/?model=distilbert-base-uncased">
|
207 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
208 |
+
</a>
|
tests/cards/distilroberta-base.md
ADDED
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card for DistilRoBERTa base
|
2 |
+
|
3 |
+
# Table of Contents
|
4 |
+
|
5 |
+
1. [Model Details](#model-details)
|
6 |
+
2. [Uses](#uses)
|
7 |
+
3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
8 |
+
4. [Training Details](#training-details)
|
9 |
+
5. [Evaluation](#evaluation)
|
10 |
+
6. [Environmental Impact](#environmental-impact)
|
11 |
+
7. [Citation](#citation)
|
12 |
+
8. [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
13 |
+
|
14 |
+
# Model Details
|
15 |
+
|
16 |
+
## Model Description
|
17 |
+
|
18 |
+
This model is a distilled version of the [RoBERTa-base model](https://huggingface.co/roberta-base). It follows the same training procedure as [DistilBERT](https://huggingface.co/distilbert-base-uncased).
|
19 |
+
The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/distillation).
|
20 |
+
This model is case-sensitive: it makes a difference between english and English.
|
21 |
+
|
22 |
+
The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base).
|
23 |
+
On average DistilRoBERTa is twice as fast as Roberta-base.
|
24 |
+
|
25 |
+
We encourage users of this model card to check out the [RoBERTa-base model card](https://huggingface.co/roberta-base) to learn more about usage, limitations and potential biases.
|
26 |
+
|
27 |
+
- **Developed by:** Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf (Hugging Face)
|
28 |
+
- **Model type:** Transformer-based language model
|
29 |
+
- **Language(s) (NLP):** English
|
30 |
+
- **License:** Apache 2.0
|
31 |
+
- **Related Models:** [RoBERTa-base model card](https://huggingface.co/roberta-base)
|
32 |
+
- **Resources for more information:**
|
33 |
+
- [GitHub Repository](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)
|
34 |
+
- [Associated Paper](https://arxiv.org/abs/1910.01108)
|
35 |
+
|
36 |
+
# Uses
|
37 |
+
|
38 |
+
## Direct Use and Downstream Use
|
39 |
+
|
40 |
+
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that interests you.
|
41 |
+
|
42 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.
|
43 |
+
|
44 |
+
## Out of Scope Use
|
45 |
+
|
46 |
+
The model should not be used to intentionally create hostile or alienating environments for people. The model was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model.
|
47 |
+
|
48 |
+
# Bias, Risks, and Limitations
|
49 |
+
|
50 |
+
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. For example:
|
51 |
+
|
52 |
+
```python
|
53 |
+
>>> from transformers import pipeline
|
54 |
+
>>> unmasker = pipeline('fill-mask', model='distilroberta-base')
|
55 |
+
>>> unmasker("The man worked as a <mask>.")
|
56 |
+
[{'score': 0.1237526461482048,
|
57 |
+
'sequence': 'The man worked as a waiter.',
|
58 |
+
'token': 38233,
|
59 |
+
'token_str': ' waiter'},
|
60 |
+
{'score': 0.08968018740415573,
|
61 |
+
'sequence': 'The man worked as a waitress.',
|
62 |
+
'token': 35698,
|
63 |
+
'token_str': ' waitress'},
|
64 |
+
{'score': 0.08387645334005356,
|
65 |
+
'sequence': 'The man worked as a bartender.',
|
66 |
+
'token': 33080,
|
67 |
+
'token_str': ' bartender'},
|
68 |
+
{'score': 0.061059024184942245,
|
69 |
+
'sequence': 'The man worked as a mechanic.',
|
70 |
+
'token': 25682,
|
71 |
+
'token_str': ' mechanic'},
|
72 |
+
{'score': 0.03804653510451317,
|
73 |
+
'sequence': 'The man worked as a courier.',
|
74 |
+
'token': 37171,
|
75 |
+
'token_str': ' courier'}]
|
76 |
+
|
77 |
+
>>> unmasker("The woman worked as a <mask>.")
|
78 |
+
[{'score': 0.23149248957633972,
|
79 |
+
'sequence': 'The woman worked as a waitress.',
|
80 |
+
'token': 35698,
|
81 |
+
'token_str': ' waitress'},
|
82 |
+
{'score': 0.07563332468271255,
|
83 |
+
'sequence': 'The woman worked as a waiter.',
|
84 |
+
'token': 38233,
|
85 |
+
'token_str': ' waiter'},
|
86 |
+
{'score': 0.06983394920825958,
|
87 |
+
'sequence': 'The woman worked as a bartender.',
|
88 |
+
'token': 33080,
|
89 |
+
'token_str': ' bartender'},
|
90 |
+
{'score': 0.05411609262228012,
|
91 |
+
'sequence': 'The woman worked as a nurse.',
|
92 |
+
'token': 9008,
|
93 |
+
'token_str': ' nurse'},
|
94 |
+
{'score': 0.04995106905698776,
|
95 |
+
'sequence': 'The woman worked as a maid.',
|
96 |
+
'token': 29754,
|
97 |
+
'token_str': ' maid'}]
|
98 |
+
```
|
99 |
+
|
100 |
+
## Recommendations
|
101 |
+
|
102 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
|
103 |
+
|
104 |
+
# Training Details
|
105 |
+
|
106 |
+
DistilRoBERTa was pre-trained on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). See the [roberta-base model card](https://huggingface.co/roberta-base/blob/main/README.md) for further details on training.
|
107 |
+
|
108 |
+
# Evaluation
|
109 |
+
|
110 |
+
When fine-tuned on downstream tasks, this model achieves the following results (see [GitHub Repo](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)):
|
111 |
+
|
112 |
+
Glue test results:
|
113 |
+
|
114 |
+
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|
115 |
+
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
|
116 |
+
| | 84.0 | 89.4 | 90.8 | 92.5 | 59.3 | 88.3 | 86.6 | 67.9 |
|
117 |
+
|
118 |
+
# Environmental Impact
|
119 |
+
|
120 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
121 |
+
|
122 |
+
- **Hardware Type:** More information needed
|
123 |
+
- **Hours used:** More information needed
|
124 |
+
- **Cloud Provider:** More information needed
|
125 |
+
- **Compute Region:** More information needed
|
126 |
+
- **Carbon Emitted:** More information needed
|
127 |
+
|
128 |
+
# Citation
|
129 |
+
|
130 |
+
```bibtex
|
131 |
+
@article{Sanh2019DistilBERTAD,
|
132 |
+
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
|
133 |
+
author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
|
134 |
+
journal={ArXiv},
|
135 |
+
year={2019},
|
136 |
+
volume={abs/1910.01108}
|
137 |
+
}
|
138 |
+
```
|
139 |
+
|
140 |
+
APA
|
141 |
+
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
|
142 |
+
|
143 |
+
# How to Get Started With the Model
|
144 |
+
|
145 |
+
You can use the model directly with a pipeline for masked language modeling:
|
146 |
+
|
147 |
+
```python
|
148 |
+
>>> from transformers import pipeline
|
149 |
+
>>> unmasker = pipeline('fill-mask', model='distilroberta-base')
|
150 |
+
>>> unmasker("Hello I'm a <mask> model.")
|
151 |
+
[{'score': 0.04673689603805542,
|
152 |
+
'sequence': "Hello I'm a business model.",
|
153 |
+
'token': 265,
|
154 |
+
'token_str': ' business'},
|
155 |
+
{'score': 0.03846118599176407,
|
156 |
+
'sequence': "Hello I'm a freelance model.",
|
157 |
+
'token': 18150,
|
158 |
+
'token_str': ' freelance'},
|
159 |
+
{'score': 0.03308931365609169,
|
160 |
+
'sequence': "Hello I'm a fashion model.",
|
161 |
+
'token': 2734,
|
162 |
+
'token_str': ' fashion'},
|
163 |
+
{'score': 0.03018997237086296,
|
164 |
+
'sequence': "Hello I'm a role model.",
|
165 |
+
'token': 774,
|
166 |
+
'token_str': ' role'},
|
167 |
+
{'score': 0.02111748233437538,
|
168 |
+
'sequence': "Hello I'm a Playboy model.",
|
169 |
+
'token': 24526,
|
170 |
+
'token_str': ' Playboy'}]
|
171 |
+
```
|
172 |
+
|
173 |
+
<a href="https://huggingface.co/exbert/?model=distilroberta-base">
|
174 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
175 |
+
</a>
|
tests/cards/emilyalsentzer___Bio_ClinicalBERT.md
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ClinicalBERT - Bio + Clinical BERT Model
|
2 |
+
|
3 |
+
The [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323) paper contains four unique clinicalBERT models: initialized with BERT-Base (`cased_L-12_H-768_A-12`) or BioBERT (`BioBERT-Base v1.0 + PubMed 200K + PMC 270K`) & trained on either all MIMIC notes or only discharge summaries.
|
4 |
+
|
5 |
+
This model card describes the Bio+Clinical BERT model, which was initialized from [BioBERT](https://arxiv.org/abs/1901.08746) & trained on all MIMIC notes.
|
6 |
+
|
7 |
+
## Pretraining Data
|
8 |
+
The `Bio_ClinicalBERT` model was trained on all notes from [MIMIC III](https://www.nature.com/articles/sdata201635), a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see [here](https://mimic.physionet.org/). All notes from the `NOTEEVENTS` table were included (~880M words).
|
9 |
+
|
10 |
+
## Model Pretraining
|
11 |
+
|
12 |
+
### Note Preprocessing
|
13 |
+
Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (`en core sci md` tokenizer).
|
14 |
+
|
15 |
+
### Pretraining Procedures
|
16 |
+
The model was trained using code from [Google's BERT repository](https://github.com/google-research/bert) on a GeForce GTX TITAN X 12 GB GPU. Model parameters were initialized with BioBERT (`BioBERT-Base v1.0 + PubMed 200K + PMC 270K`).
|
17 |
+
|
18 |
+
### Pretraining Hyperparameters
|
19 |
+
We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15
|
20 |
+
and max predictions per sequence = 20).
|
21 |
+
|
22 |
+
## How to use the model
|
23 |
+
|
24 |
+
Load the model via the transformers library:
|
25 |
+
```
|
26 |
+
from transformers import AutoTokenizer, AutoModel
|
27 |
+
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
|
28 |
+
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
|
29 |
+
```
|
30 |
+
|
31 |
+
## More Information
|
32 |
+
|
33 |
+
Refer to the original paper, [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323) (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.
|
34 |
+
|
35 |
+
## Questions?
|
36 |
+
|
37 |
+
Post a Github issue on the [clinicalBERT repo](https://github.com/EmilyAlsentzer/clinicalBERT) or email [email protected] with any questions.
|
tests/cards/facebook___bart-large-mnli.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# bart-large-mnli
|
2 |
+
|
3 |
+
This is the checkpoint for [bart-large](https://huggingface.co/facebook/bart-large) after being trained on the [MultiNLI (MNLI)](https://huggingface.co/datasets/multi_nli) dataset.
|
4 |
+
|
5 |
+
Additional information about this model:
|
6 |
+
- The [bart-large](https://huggingface.co/facebook/bart-large) model page
|
7 |
+
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
|
8 |
+
](https://arxiv.org/abs/1910.13461)
|
9 |
+
- [BART fairseq implementation](https://github.com/pytorch/fairseq/tree/master/fairseq/models/bart)
|
10 |
+
|
11 |
+
## NLI-based Zero Shot Text Classification
|
12 |
+
|
13 |
+
[Yin et al.](https://arxiv.org/abs/1909.00161) proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the class "politics", we could construct a hypothesis of `This text is about politics.`. The probabilities for entailment and contradiction are then converted to label probabilities.
|
14 |
+
|
15 |
+
This method is surprisingly effective in many cases, particularly when used with larger pre-trained models like BART and Roberta. See [this blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html) for a more expansive introduction to this and other zero shot methods, and see the code snippets below for examples of using this model for zero-shot classification both with Hugging Face's built-in pipeline and with native Transformers/PyTorch code.
|
16 |
+
|
17 |
+
#### With the zero-shot classification pipeline
|
18 |
+
|
19 |
+
The model can be loaded with the `zero-shot-classification` pipeline like so:
|
20 |
+
|
21 |
+
```python
|
22 |
+
from transformers import pipeline
|
23 |
+
classifier = pipeline("zero-shot-classification",
|
24 |
+
model="facebook/bart-large-mnli")
|
25 |
+
```
|
26 |
+
|
27 |
+
You can then use this pipeline to classify sequences into any of the class names you specify.
|
28 |
+
|
29 |
+
```python
|
30 |
+
sequence_to_classify = "one day I will see the world"
|
31 |
+
candidate_labels = ['travel', 'cooking', 'dancing']
|
32 |
+
classifier(sequence_to_classify, candidate_labels)
|
33 |
+
#{'labels': ['travel', 'dancing', 'cooking'],
|
34 |
+
# 'scores': [0.9938651323318481, 0.0032737774308770895, 0.002861034357920289],
|
35 |
+
# 'sequence': 'one day I will see the world'}
|
36 |
+
```
|
37 |
+
|
38 |
+
If more than one candidate label can be correct, pass `multi_class=True` to calculate each class independently:
|
39 |
+
|
40 |
+
```python
|
41 |
+
candidate_labels = ['travel', 'cooking', 'dancing', 'exploration']
|
42 |
+
classifier(sequence_to_classify, candidate_labels, multi_class=True)
|
43 |
+
#{'labels': ['travel', 'exploration', 'dancing', 'cooking'],
|
44 |
+
# 'scores': [0.9945111274719238,
|
45 |
+
# 0.9383890628814697,
|
46 |
+
# 0.0057061901316046715,
|
47 |
+
# 0.0018193122232332826],
|
48 |
+
# 'sequence': 'one day I will see the world'}
|
49 |
+
```
|
50 |
+
|
51 |
+
|
52 |
+
#### With manual PyTorch
|
53 |
+
|
54 |
+
```python
|
55 |
+
# pose sequence as a NLI premise and label as a hypothesis
|
56 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
57 |
+
nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
|
58 |
+
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
|
59 |
+
|
60 |
+
premise = sequence
|
61 |
+
hypothesis = f'This example is {label}.'
|
62 |
+
|
63 |
+
# run through model pre-trained on MNLI
|
64 |
+
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
|
65 |
+
truncation_strategy='only_first')
|
66 |
+
logits = nli_model(x.to(device))[0]
|
67 |
+
|
68 |
+
# we throw away "neutral" (dim 1) and take the probability of
|
69 |
+
# "entailment" (2) as the probability of the label being true
|
70 |
+
entail_contradiction_logits = logits[:,[0,2]]
|
71 |
+
probs = entail_contradiction_logits.softmax(dim=1)
|
72 |
+
prob_label_is_true = probs[:,1]
|
73 |
+
```
|
tests/cards/google___electra-base-discriminator.md
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
|
2 |
+
|
3 |
+
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
|
4 |
+
|
5 |
+
For a detailed description and experimental results, please refer to our paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
|
6 |
+
|
7 |
+
This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. [GLUE](https://gluebenchmark.com/)), QA tasks (e.g., [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)), and sequence tagging tasks (e.g., [text chunking](https://www.clips.uantwerpen.be/conll2000/chunking/)).
|
8 |
+
|
9 |
+
## How to use the discriminator in `transformers`
|
10 |
+
|
11 |
+
```python
|
12 |
+
from transformers import ElectraForPreTraining, ElectraTokenizerFast
|
13 |
+
import torch
|
14 |
+
|
15 |
+
discriminator = ElectraForPreTraining.from_pretrained("google/electra-base-discriminator")
|
16 |
+
tokenizer = ElectraTokenizerFast.from_pretrained("google/electra-base-discriminator")
|
17 |
+
|
18 |
+
sentence = "The quick brown fox jumps over the lazy dog"
|
19 |
+
fake_sentence = "The quick brown fox fake over the lazy dog"
|
20 |
+
|
21 |
+
fake_tokens = tokenizer.tokenize(fake_sentence)
|
22 |
+
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
|
23 |
+
discriminator_outputs = discriminator(fake_inputs)
|
24 |
+
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
|
25 |
+
|
26 |
+
[print("%7s" % token, end="") for token in fake_tokens]
|
27 |
+
|
28 |
+
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()]
|
29 |
+
```
|
tests/cards/gpt2.md
ADDED
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# GPT-2
|
2 |
+
|
3 |
+
Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large
|
4 |
+
|
5 |
+
Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in
|
6 |
+
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
|
7 |
+
and first released at [this page](https://openai.com/blog/better-language-models/).
|
8 |
+
|
9 |
+
Disclaimer: The team releasing GPT-2 also wrote a
|
10 |
+
[model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card
|
11 |
+
has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.
|
12 |
+
|
13 |
+
## Model description
|
14 |
+
|
15 |
+
GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This
|
16 |
+
means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots
|
17 |
+
of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely,
|
18 |
+
it was trained to guess the next word in sentences.
|
19 |
+
|
20 |
+
More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence,
|
21 |
+
shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the
|
22 |
+
predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens.
|
23 |
+
|
24 |
+
This way, the model learns an inner representation of the English language that can then be used to extract features
|
25 |
+
useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a
|
26 |
+
prompt.
|
27 |
+
|
28 |
+
This is the **smallest** version of GPT-2, with 124M parameters.
|
29 |
+
|
30 |
+
**Related Models:** [GPT-Large](https://huggingface.co/gpt2-large), [GPT-Medium](https://huggingface.co/gpt2-medium) and [GPT-XL](https://huggingface.co/gpt2-xl)
|
31 |
+
|
32 |
+
## Intended uses & limitations
|
33 |
+
|
34 |
+
You can use the raw model for text generation or fine-tune it to a downstream task. See the
|
35 |
+
[model hub](https://huggingface.co/models?filter=gpt2) to look for fine-tuned versions on a task that interests you.
|
36 |
+
|
37 |
+
### How to use
|
38 |
+
|
39 |
+
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
|
40 |
+
set a seed for reproducibility:
|
41 |
+
|
42 |
+
```python
|
43 |
+
>>> from transformers import pipeline, set_seed
|
44 |
+
>>> generator = pipeline('text-generation', model='gpt2')
|
45 |
+
>>> set_seed(42)
|
46 |
+
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
|
47 |
+
|
48 |
+
[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
|
49 |
+
{'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
|
50 |
+
{'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},
|
51 |
+
{'generated_text': "Hello, I'm a language model, a system model. I want to know my language so that it might be more interesting, more user-friendly"},
|
52 |
+
{'generated_text': 'Hello, I\'m a language model, not a language model"\n\nThe concept of "no-tricks" comes in handy later with new'}]
|
53 |
+
```
|
54 |
+
|
55 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
56 |
+
|
57 |
+
```python
|
58 |
+
from transformers import GPT2Tokenizer, GPT2Model
|
59 |
+
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
|
60 |
+
model = GPT2Model.from_pretrained('gpt2')
|
61 |
+
text = "Replace me by any text you'd like."
|
62 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
63 |
+
output = model(**encoded_input)
|
64 |
+
```
|
65 |
+
|
66 |
+
and in TensorFlow:
|
67 |
+
|
68 |
+
```python
|
69 |
+
from transformers import GPT2Tokenizer, TFGPT2Model
|
70 |
+
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
|
71 |
+
model = TFGPT2Model.from_pretrained('gpt2')
|
72 |
+
text = "Replace me by any text you'd like."
|
73 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
74 |
+
output = model(encoded_input)
|
75 |
+
```
|
76 |
+
|
77 |
+
### Limitations and bias
|
78 |
+
|
79 |
+
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
|
80 |
+
unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their
|
81 |
+
[model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases):
|
82 |
+
|
83 |
+
> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases
|
84 |
+
> that require the generated text to be true.
|
85 |
+
>
|
86 |
+
> Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
|
87 |
+
> not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
|
88 |
+
> study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
|
89 |
+
> and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
|
90 |
+
> levels of caution around use cases that are sensitive to biases around human attributes.
|
91 |
+
|
92 |
+
Here's an example of how the model can have biased predictions:
|
93 |
+
|
94 |
+
```python
|
95 |
+
>>> from transformers import pipeline, set_seed
|
96 |
+
>>> generator = pipeline('text-generation', model='gpt2')
|
97 |
+
>>> set_seed(42)
|
98 |
+
>>> generator("The White man worked as a", max_length=10, num_return_sequences=5)
|
99 |
+
|
100 |
+
[{'generated_text': 'The White man worked as a mannequin for'},
|
101 |
+
{'generated_text': 'The White man worked as a maniser of the'},
|
102 |
+
{'generated_text': 'The White man worked as a bus conductor by day'},
|
103 |
+
{'generated_text': 'The White man worked as a plumber at the'},
|
104 |
+
{'generated_text': 'The White man worked as a journalist. He had'}]
|
105 |
+
|
106 |
+
>>> set_seed(42)
|
107 |
+
>>> generator("The Black man worked as a", max_length=10, num_return_sequences=5)
|
108 |
+
|
109 |
+
[{'generated_text': 'The Black man worked as a man at a restaurant'},
|
110 |
+
{'generated_text': 'The Black man worked as a car salesman in a'},
|
111 |
+
{'generated_text': 'The Black man worked as a police sergeant at the'},
|
112 |
+
{'generated_text': 'The Black man worked as a man-eating monster'},
|
113 |
+
{'generated_text': 'The Black man worked as a slave, and was'}]
|
114 |
+
```
|
115 |
+
|
116 |
+
This bias will also affect all fine-tuned versions of this model.
|
117 |
+
|
118 |
+
## Training data
|
119 |
+
|
120 |
+
The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web
|
121 |
+
pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from
|
122 |
+
this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights
|
123 |
+
40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText
|
124 |
+
[here](https://github.com/openai/gpt-2/blob/master/domains.txt).
|
125 |
+
|
126 |
+
## Training procedure
|
127 |
+
|
128 |
+
### Preprocessing
|
129 |
+
|
130 |
+
The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a
|
131 |
+
vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens.
|
132 |
+
|
133 |
+
The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor were the exact
|
134 |
+
details of training.
|
135 |
+
|
136 |
+
## Evaluation results
|
137 |
+
|
138 |
+
The model achieves the following results without any fine-tuning (zero-shot):
|
139 |
+
|
140 |
+
| Dataset | LAMBADA | LAMBADA | CBT-CN | CBT-NE | WikiText2 | PTB | enwiki8 | text8 | WikiText103 | 1BW |
|
141 |
+
|:--------:|:-------:|:-------:|:------:|:------:|:---------:|:------:|:-------:|:------:|:-----------:|:-----:|
|
142 |
+
| (metric) | (PPL) | (ACC) | (ACC) | (ACC) | (PPL) | (PPL) | (BPB) | (BPC) | (PPL) | (PPL) |
|
143 |
+
| | 35.13 | 45.99 | 87.65 | 83.4 | 29.41 | 65.85 | 1.16 | 1,17 | 37.50 | 75.20 |
|
144 |
+
|
145 |
+
|
146 |
+
### BibTeX entry and citation info
|
147 |
+
|
148 |
+
```bibtex
|
149 |
+
@article{radford2019language,
|
150 |
+
title={Language Models are Unsupervised Multitask Learners},
|
151 |
+
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
|
152 |
+
year={2019}
|
153 |
+
}
|
154 |
+
```
|
155 |
+
|
156 |
+
<a href="https://huggingface.co/exbert/?model=gpt2">
|
157 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
158 |
+
</a>
|
tests/cards/jonatasgrosman___wav2vec2-large-xlsr-53-english.md
ADDED
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Fine-tuned XLSR-53 large model for speech recognition in English
|
2 |
+
|
3 |
+
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on English using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).
|
4 |
+
When using this model, make sure that your speech input is sampled at 16kHz.
|
5 |
+
|
6 |
+
This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
|
7 |
+
|
8 |
+
The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
|
9 |
+
|
10 |
+
## Usage
|
11 |
+
|
12 |
+
The model can be used directly (without a language model) as follows...
|
13 |
+
|
14 |
+
Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:
|
15 |
+
|
16 |
+
```python
|
17 |
+
from huggingsound import SpeechRecognitionModel
|
18 |
+
|
19 |
+
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
|
20 |
+
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
|
21 |
+
|
22 |
+
transcriptions = model.transcribe(audio_paths)
|
23 |
+
```
|
24 |
+
|
25 |
+
Writing your own inference script:
|
26 |
+
|
27 |
+
```python
|
28 |
+
import torch
|
29 |
+
import librosa
|
30 |
+
from datasets import load_dataset
|
31 |
+
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
|
32 |
+
|
33 |
+
LANG_ID = "en"
|
34 |
+
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
|
35 |
+
SAMPLES = 10
|
36 |
+
|
37 |
+
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
|
38 |
+
|
39 |
+
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
|
40 |
+
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
|
41 |
+
|
42 |
+
# Preprocessing the datasets.
|
43 |
+
# We need to read the audio files as arrays
|
44 |
+
def speech_file_to_array_fn(batch):
|
45 |
+
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
|
46 |
+
batch["speech"] = speech_array
|
47 |
+
batch["sentence"] = batch["sentence"].upper()
|
48 |
+
return batch
|
49 |
+
|
50 |
+
test_dataset = test_dataset.map(speech_file_to_array_fn)
|
51 |
+
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
|
52 |
+
|
53 |
+
with torch.no_grad():
|
54 |
+
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
|
55 |
+
|
56 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
57 |
+
predicted_sentences = processor.batch_decode(predicted_ids)
|
58 |
+
|
59 |
+
for i, predicted_sentence in enumerate(predicted_sentences):
|
60 |
+
print("-" * 100)
|
61 |
+
print("Reference:", test_dataset[i]["sentence"])
|
62 |
+
print("Prediction:", predicted_sentence)
|
63 |
+
```
|
64 |
+
|
65 |
+
| Reference | Prediction |
|
66 |
+
| ------------- | ------------- |
|
67 |
+
| "SHE'LL BE ALL RIGHT." | SHE'LL BE ALL RIGHT |
|
68 |
+
| SIX | SIX |
|
69 |
+
| "ALL'S WELL THAT ENDS WELL." | ALL AS WELL THAT ENDS WELL |
|
70 |
+
| DO YOU MEAN IT? | DO YOU MEAN IT |
|
71 |
+
| THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESSION |
|
72 |
+
| HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? | HOW IS MOSLILLAR GOING TO HANDLE ANDBEWOOTH HIS LIKE Q AND Q |
|
73 |
+
| "I GUESS YOU MUST THINK I'M KINDA BATTY." | RUSTIAN WASTIN PAN ONTE BATTLY |
|
74 |
+
| NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
|
75 |
+
| SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. | SAUCE FOR THE GUICE IS SAUCE FOR THE GONDER |
|
76 |
+
| GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. | GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD |
|
77 |
+
|
78 |
+
## Evaluation
|
79 |
+
|
80 |
+
1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
|
81 |
+
|
82 |
+
```bash
|
83 |
+
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test
|
84 |
+
```
|
85 |
+
|
86 |
+
2. To evaluate on `speech-recognition-community-v2/dev_data`
|
87 |
+
|
88 |
+
```bash
|
89 |
+
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0
|
90 |
+
```
|
91 |
+
|
92 |
+
## Citation
|
93 |
+
If you want to cite this model you can use this:
|
94 |
+
|
95 |
+
```bibtex
|
96 |
+
@misc{grosman2021xlsr53-large-english,
|
97 |
+
title={Fine-tuned {XLSR}-53 large model for speech recognition in {E}nglish},
|
98 |
+
author={Grosman, Jonatas},
|
99 |
+
howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english}},
|
100 |
+
year={2021}
|
101 |
+
}
|
102 |
+
```
|
tests/cards/microsoft___layoutlmv3-base.md
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# LayoutLMv3
|
2 |
+
|
3 |
+
[Microsoft Document AI](https://www.microsoft.com/en-us/research/project/document-ai/) | [GitHub](https://aka.ms/layoutlmv3)
|
4 |
+
|
5 |
+
## Model description
|
6 |
+
|
7 |
+
LayoutLMv3 is a pre-trained multimodal Transformer for Document AI with unified text and image masking. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model. For example, LayoutLMv3 can be fine-tuned for both text-centric tasks, including form understanding, receipt understanding, and document visual question answering, and image-centric tasks such as document image classification and document layout analysis.
|
8 |
+
|
9 |
+
[LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)
|
10 |
+
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei, ACM Multimedia 2022.
|
11 |
+
|
12 |
+
## Citation
|
13 |
+
|
14 |
+
If you find LayoutLM useful in your research, please cite the following paper:
|
15 |
+
|
16 |
+
```
|
17 |
+
@inproceedings{huang2022layoutlmv3,
|
18 |
+
author={Yupan Huang and Tengchao Lv and Lei Cui and Yutong Lu and Furu Wei},
|
19 |
+
title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
|
20 |
+
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
|
21 |
+
year={2022}
|
22 |
+
}
|
23 |
+
```
|
24 |
+
|
25 |
+
## License
|
26 |
+
|
27 |
+
The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/).
|
28 |
+
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.
|
29 |
+
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
|
tests/cards/openai___clip-vit-base-patch32.md
ADDED
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card: CLIP
|
2 |
+
|
3 |
+
Disclaimer: The model card is taken and modified from the official CLIP repository, it can be found [here](https://github.com/openai/CLIP/blob/main/model-card.md).
|
4 |
+
|
5 |
+
## Model Details
|
6 |
+
|
7 |
+
The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.
|
8 |
+
|
9 |
+
### Model Date
|
10 |
+
|
11 |
+
January 2021
|
12 |
+
|
13 |
+
### Model Type
|
14 |
+
|
15 |
+
The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
|
16 |
+
|
17 |
+
The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
|
18 |
+
|
19 |
+
|
20 |
+
### Documents
|
21 |
+
|
22 |
+
- [Blog Post](https://openai.com/blog/clip/)
|
23 |
+
- [CLIP Paper](https://arxiv.org/abs/2103.00020)
|
24 |
+
|
25 |
+
|
26 |
+
### Use with Transformers
|
27 |
+
|
28 |
+
```python3
|
29 |
+
from PIL import Image
|
30 |
+
import requests
|
31 |
+
|
32 |
+
from transformers import CLIPProcessor, CLIPModel
|
33 |
+
|
34 |
+
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
|
35 |
+
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
36 |
+
|
37 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
38 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
39 |
+
|
40 |
+
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
|
41 |
+
|
42 |
+
outputs = model(**inputs)
|
43 |
+
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
44 |
+
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
45 |
+
```
|
46 |
+
|
47 |
+
|
48 |
+
## Model Use
|
49 |
+
|
50 |
+
### Intended Use
|
51 |
+
|
52 |
+
The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
|
53 |
+
|
54 |
+
#### Primary intended uses
|
55 |
+
|
56 |
+
The primary intended users of these models are AI researchers.
|
57 |
+
|
58 |
+
We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
|
59 |
+
|
60 |
+
### Out-of-Scope Use Cases
|
61 |
+
|
62 |
+
**Any** deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.
|
63 |
+
|
64 |
+
Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
|
65 |
+
|
66 |
+
Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
|
67 |
+
|
68 |
+
|
69 |
+
|
70 |
+
## Data
|
71 |
+
|
72 |
+
The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as [YFCC100M](http://projects.dfki.uni-kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.
|
73 |
+
|
74 |
+
### Data Mission Statement
|
75 |
+
|
76 |
+
Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset.
|
77 |
+
|
78 |
+
|
79 |
+
|
80 |
+
## Performance and Limitations
|
81 |
+
|
82 |
+
### Performance
|
83 |
+
|
84 |
+
We have evaluated the performance of CLIP on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The paper describes model performance on the following datasets:
|
85 |
+
|
86 |
+
- Food101
|
87 |
+
- CIFAR10
|
88 |
+
- CIFAR100
|
89 |
+
- Birdsnap
|
90 |
+
- SUN397
|
91 |
+
- Stanford Cars
|
92 |
+
- FGVC Aircraft
|
93 |
+
- VOC2007
|
94 |
+
- DTD
|
95 |
+
- Oxford-IIIT Pet dataset
|
96 |
+
- Caltech101
|
97 |
+
- Flowers102
|
98 |
+
- MNIST
|
99 |
+
- SVHN
|
100 |
+
- IIIT5K
|
101 |
+
- Hateful Memes
|
102 |
+
- SST-2
|
103 |
+
- UCF101
|
104 |
+
- Kinetics700
|
105 |
+
- Country211
|
106 |
+
- CLEVR Counting
|
107 |
+
- KITTI Distance
|
108 |
+
- STL-10
|
109 |
+
- RareAct
|
110 |
+
- Flickr30
|
111 |
+
- MSCOCO
|
112 |
+
- ImageNet
|
113 |
+
- ImageNet-A
|
114 |
+
- ImageNet-R
|
115 |
+
- ImageNet Sketch
|
116 |
+
- ObjectNet (ImageNet Overlap)
|
117 |
+
- Youtube-BB
|
118 |
+
- ImageNet-Vid
|
119 |
+
|
120 |
+
## Limitations
|
121 |
+
|
122 |
+
CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance.
|
123 |
+
|
124 |
+
### Bias and Fairness
|
125 |
+
|
126 |
+
We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from [Fairface](https://arxiv.org/abs/1908.04913) into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper).
|
127 |
+
|
128 |
+
We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
|
129 |
+
|
130 |
+
|
131 |
+
|
132 |
+
## Feedback
|
133 |
+
|
134 |
+
### Where to send questions or comments about the model
|
135 |
+
|
136 |
+
Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)
|
tests/cards/openai___clip-vit-large-patch14.md
ADDED
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card: CLIP
|
2 |
+
|
3 |
+
Disclaimer: The model card is taken and modified from the official CLIP repository, it can be found [here](https://github.com/openai/CLIP/blob/main/model-card.md).
|
4 |
+
|
5 |
+
## Model Details
|
6 |
+
|
7 |
+
The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.
|
8 |
+
|
9 |
+
### Model Date
|
10 |
+
|
11 |
+
January 2021
|
12 |
+
|
13 |
+
### Model Type
|
14 |
+
|
15 |
+
The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
|
16 |
+
|
17 |
+
The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
|
18 |
+
|
19 |
+
|
20 |
+
### Documents
|
21 |
+
|
22 |
+
- [Blog Post](https://openai.com/blog/clip/)
|
23 |
+
- [CLIP Paper](https://arxiv.org/abs/2103.00020)
|
24 |
+
|
25 |
+
|
26 |
+
### Use with Transformers
|
27 |
+
|
28 |
+
```python
|
29 |
+
from PIL import Image
|
30 |
+
import requests
|
31 |
+
|
32 |
+
from transformers import CLIPProcessor, CLIPModel
|
33 |
+
|
34 |
+
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
|
35 |
+
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
|
36 |
+
|
37 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
38 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
39 |
+
|
40 |
+
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
|
41 |
+
|
42 |
+
outputs = model(**inputs)
|
43 |
+
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
44 |
+
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
45 |
+
```
|
46 |
+
|
47 |
+
|
48 |
+
## Model Use
|
49 |
+
|
50 |
+
### Intended Use
|
51 |
+
|
52 |
+
The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
|
53 |
+
|
54 |
+
#### Primary intended uses
|
55 |
+
|
56 |
+
The primary intended users of these models are AI researchers.
|
57 |
+
|
58 |
+
We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
|
59 |
+
|
60 |
+
### Out-of-Scope Use Cases
|
61 |
+
|
62 |
+
**Any** deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.
|
63 |
+
|
64 |
+
Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
|
65 |
+
|
66 |
+
Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
|
67 |
+
|
68 |
+
|
69 |
+
|
70 |
+
## Data
|
71 |
+
|
72 |
+
The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as [YFCC100M](http://projects.dfki.uni-kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.
|
73 |
+
|
74 |
+
### Data Mission Statement
|
75 |
+
|
76 |
+
Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset.
|
77 |
+
|
78 |
+
|
79 |
+
|
80 |
+
## Performance and Limitations
|
81 |
+
|
82 |
+
### Performance
|
83 |
+
|
84 |
+
We have evaluated the performance of CLIP on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The paper describes model performance on the following datasets:
|
85 |
+
|
86 |
+
- Food101
|
87 |
+
- CIFAR10
|
88 |
+
- CIFAR100
|
89 |
+
- Birdsnap
|
90 |
+
- SUN397
|
91 |
+
- Stanford Cars
|
92 |
+
- FGVC Aircraft
|
93 |
+
- VOC2007
|
94 |
+
- DTD
|
95 |
+
- Oxford-IIIT Pet dataset
|
96 |
+
- Caltech101
|
97 |
+
- Flowers102
|
98 |
+
- MNIST
|
99 |
+
- SVHN
|
100 |
+
- IIIT5K
|
101 |
+
- Hateful Memes
|
102 |
+
- SST-2
|
103 |
+
- UCF101
|
104 |
+
- Kinetics700
|
105 |
+
- Country211
|
106 |
+
- CLEVR Counting
|
107 |
+
- KITTI Distance
|
108 |
+
- STL-10
|
109 |
+
- RareAct
|
110 |
+
- Flickr30
|
111 |
+
- MSCOCO
|
112 |
+
- ImageNet
|
113 |
+
- ImageNet-A
|
114 |
+
- ImageNet-R
|
115 |
+
- ImageNet Sketch
|
116 |
+
- ObjectNet (ImageNet Overlap)
|
117 |
+
- Youtube-BB
|
118 |
+
- ImageNet-Vid
|
119 |
+
|
120 |
+
## Limitations
|
121 |
+
|
122 |
+
CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance.
|
123 |
+
|
124 |
+
### Bias and Fairness
|
125 |
+
|
126 |
+
We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from [Fairface](https://arxiv.org/abs/1908.04913) into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper).
|
127 |
+
|
128 |
+
We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
|
129 |
+
|
130 |
+
|
131 |
+
|
132 |
+
## Feedback
|
133 |
+
|
134 |
+
### Where to send questions or comments about the model
|
135 |
+
|
136 |
+
Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)
|
tests/cards/philschmid___bart-large-cnn-samsum.md
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## `bart-large-cnn-samsum`
|
2 |
+
|
3 |
+
> If you want to use the model you should try a newer fine-tuned FLAN-T5 version [philschmid/flan-t5-base-samsum](https://huggingface.co/philschmid/flan-t5-base-samsum) out socring the BART version with `+6` on `ROGUE1` achieving `47.24`.
|
4 |
+
|
5 |
+
# TRY [philschmid/flan-t5-base-samsum](https://huggingface.co/philschmid/flan-t5-base-samsum)
|
6 |
+
|
7 |
+
|
8 |
+
This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container.
|
9 |
+
|
10 |
+
For more information look at:
|
11 |
+
- [🤗 Transformers Documentation: Amazon SageMaker](https://huggingface.co/transformers/sagemaker.html)
|
12 |
+
- [Example Notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker)
|
13 |
+
- [Amazon SageMaker documentation for Hugging Face](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html)
|
14 |
+
- [Python SDK SageMaker documentation for Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)
|
15 |
+
- [Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers)
|
16 |
+
|
17 |
+
## Hyperparameters
|
18 |
+
```json
|
19 |
+
{
|
20 |
+
"dataset_name": "samsum",
|
21 |
+
"do_eval": true,
|
22 |
+
"do_predict": true,
|
23 |
+
"do_train": true,
|
24 |
+
"fp16": true,
|
25 |
+
"learning_rate": 5e-05,
|
26 |
+
"model_name_or_path": "facebook/bart-large-cnn",
|
27 |
+
"num_train_epochs": 3,
|
28 |
+
"output_dir": "/opt/ml/model",
|
29 |
+
"per_device_eval_batch_size": 4,
|
30 |
+
"per_device_train_batch_size": 4,
|
31 |
+
"predict_with_generate": true,
|
32 |
+
"seed": 7
|
33 |
+
}
|
34 |
+
```
|
35 |
+
|
36 |
+
## Usage
|
37 |
+
```python
|
38 |
+
from transformers import pipeline
|
39 |
+
summarizer = pipeline("summarization", model="philschmid/bart-large-cnn-samsum")
|
40 |
+
|
41 |
+
conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker?
|
42 |
+
Philipp: Sure you can use the new Hugging Face Deep Learning Container.
|
43 |
+
Jeff: ok.
|
44 |
+
Jeff: and how can I get started?
|
45 |
+
Jeff: where can I find documentation?
|
46 |
+
Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face
|
47 |
+
'''
|
48 |
+
summarizer(conversation)
|
49 |
+
```
|
50 |
+
|
51 |
+
## Results
|
52 |
+
|
53 |
+
| key | value |
|
54 |
+
| --- | ----- |
|
55 |
+
| eval_rouge1 | 42.621 |
|
56 |
+
| eval_rouge2 | 21.9825 |
|
57 |
+
| eval_rougeL | 33.034 |
|
58 |
+
| eval_rougeLsum | 39.6783 |
|
59 |
+
| test_rouge1 | 41.3174 |
|
60 |
+
| test_rouge2 | 20.8716 |
|
61 |
+
| test_rougeL | 32.1337 |
|
62 |
+
| test_rougeLsum | 38.4149 |
|
tests/cards/prajjwal1___bert-tiny.md
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
The following model is a Pytorch pre-trained model obtained from converting Tensorflow checkpoint found in the [official Google BERT repository](https://github.com/google-research/bert).
|
2 |
+
|
3 |
+
This is one of the smaller pre-trained BERT variants, together with [bert-mini](https://huggingface.co/prajjwal1/bert-mini) [bert-small](https://huggingface.co/prajjwal1/bert-small) and [bert-medium](https://huggingface.co/prajjwal1/bert-medium). They were introduced in the study `Well-Read Students Learn Better: On the Importance of Pre-training Compact Models` ([arxiv](https://arxiv.org/abs/1908.08962)), and ported to HF for the study `Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics` ([arXiv](https://arxiv.org/abs/2110.01518)). These models are supposed to be trained on a downstream task.
|
4 |
+
|
5 |
+
If you use the model, please consider citing both the papers:
|
6 |
+
```
|
7 |
+
@misc{bhargava2021generalization,
|
8 |
+
title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
|
9 |
+
author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
|
10 |
+
year={2021},
|
11 |
+
eprint={2110.01518},
|
12 |
+
archivePrefix={arXiv},
|
13 |
+
primaryClass={cs.CL}
|
14 |
+
}
|
15 |
+
|
16 |
+
@article{DBLP:journals/corr/abs-1908-08962,
|
17 |
+
author = {Iulia Turc and
|
18 |
+
Ming{-}Wei Chang and
|
19 |
+
Kenton Lee and
|
20 |
+
Kristina Toutanova},
|
21 |
+
title = {Well-Read Students Learn Better: The Impact of Student Initialization
|
22 |
+
on Knowledge Distillation},
|
23 |
+
journal = {CoRR},
|
24 |
+
volume = {abs/1908.08962},
|
25 |
+
year = {2019},
|
26 |
+
url = {http://arxiv.org/abs/1908.08962},
|
27 |
+
eprinttype = {arXiv},
|
28 |
+
eprint = {1908.08962},
|
29 |
+
timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
|
30 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
|
31 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
32 |
+
}
|
33 |
+
|
34 |
+
```
|
35 |
+
Config of this model:
|
36 |
+
- `prajjwal1/bert-tiny` (L=2, H=128) [Model Link](https://huggingface.co/prajjwal1/bert-tiny)
|
37 |
+
|
38 |
+
|
39 |
+
Other models to check out:
|
40 |
+
- `prajjwal1/bert-mini` (L=4, H=256) [Model Link](https://huggingface.co/prajjwal1/bert-mini)
|
41 |
+
- `prajjwal1/bert-small` (L=4, H=512) [Model Link](https://huggingface.co/prajjwal1/bert-small)
|
42 |
+
- `prajjwal1/bert-medium` (L=8, H=512) [Model Link](https://huggingface.co/prajjwal1/bert-medium)
|
43 |
+
|
44 |
+
Original Implementation and more info can be found in [this Github repository](https://github.com/prajjwal1/generalize_lm_nli).
|
45 |
+
|
46 |
+
Twitter: [@prajjwal_1](https://twitter.com/prajjwal_1)
|
tests/cards/roberta-base.md
ADDED
@@ -0,0 +1,224 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# RoBERTa base model
|
2 |
+
|
3 |
+
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
|
4 |
+
[this paper](https://arxiv.org/abs/1907.11692) and first released in
|
5 |
+
[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it
|
6 |
+
makes a difference between english and English.
|
7 |
+
|
8 |
+
Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by
|
9 |
+
the Hugging Face team.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
|
13 |
+
RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means
|
14 |
+
it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
|
15 |
+
publicly available data) with an automatic process to generate inputs and labels from those texts.
|
16 |
+
|
17 |
+
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
|
18 |
+
randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
|
19 |
+
the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
|
20 |
+
after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
|
21 |
+
learn a bidirectional representation of the sentence.
|
22 |
+
|
23 |
+
This way, the model learns an inner representation of the English language that can then be used to extract features
|
24 |
+
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
|
25 |
+
classifier using the features produced by the BERT model as inputs.
|
26 |
+
|
27 |
+
## Intended uses & limitations
|
28 |
+
|
29 |
+
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
|
30 |
+
See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that
|
31 |
+
interests you.
|
32 |
+
|
33 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
34 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
35 |
+
generation you should look at a model like GPT2.
|
36 |
+
|
37 |
+
### How to use
|
38 |
+
|
39 |
+
You can use this model directly with a pipeline for masked language modeling:
|
40 |
+
|
41 |
+
```python
|
42 |
+
>>> from transformers import pipeline
|
43 |
+
>>> unmasker = pipeline('fill-mask', model='roberta-base')
|
44 |
+
>>> unmasker("Hello I'm a <mask> model.")
|
45 |
+
|
46 |
+
[{'sequence': "<s>Hello I'm a male model.</s>",
|
47 |
+
'score': 0.3306540250778198,
|
48 |
+
'token': 2943,
|
49 |
+
'token_str': 'Ġmale'},
|
50 |
+
{'sequence': "<s>Hello I'm a female model.</s>",
|
51 |
+
'score': 0.04655390977859497,
|
52 |
+
'token': 2182,
|
53 |
+
'token_str': 'Ġfemale'},
|
54 |
+
{'sequence': "<s>Hello I'm a professional model.</s>",
|
55 |
+
'score': 0.04232972860336304,
|
56 |
+
'token': 2038,
|
57 |
+
'token_str': 'Ġprofessional'},
|
58 |
+
{'sequence': "<s>Hello I'm a fashion model.</s>",
|
59 |
+
'score': 0.037216778844594955,
|
60 |
+
'token': 2734,
|
61 |
+
'token_str': 'Ġfashion'},
|
62 |
+
{'sequence': "<s>Hello I'm a Russian model.</s>",
|
63 |
+
'score': 0.03253649175167084,
|
64 |
+
'token': 1083,
|
65 |
+
'token_str': 'ĠRussian'}]
|
66 |
+
```
|
67 |
+
|
68 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
69 |
+
|
70 |
+
```python
|
71 |
+
from transformers import RobertaTokenizer, RobertaModel
|
72 |
+
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
|
73 |
+
model = RobertaModel.from_pretrained('roberta-base')
|
74 |
+
text = "Replace me by any text you'd like."
|
75 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
76 |
+
output = model(**encoded_input)
|
77 |
+
```
|
78 |
+
|
79 |
+
and in TensorFlow:
|
80 |
+
|
81 |
+
```python
|
82 |
+
from transformers import RobertaTokenizer, TFRobertaModel
|
83 |
+
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
|
84 |
+
model = TFRobertaModel.from_pretrained('roberta-base')
|
85 |
+
text = "Replace me by any text you'd like."
|
86 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
87 |
+
output = model(encoded_input)
|
88 |
+
```
|
89 |
+
|
90 |
+
### Limitations and bias
|
91 |
+
|
92 |
+
The training data used for this model contains a lot of unfiltered content from the internet, which is far from
|
93 |
+
neutral. Therefore, the model can have biased predictions:
|
94 |
+
|
95 |
+
```python
|
96 |
+
>>> from transformers import pipeline
|
97 |
+
>>> unmasker = pipeline('fill-mask', model='roberta-base')
|
98 |
+
>>> unmasker("The man worked as a <mask>.")
|
99 |
+
|
100 |
+
[{'sequence': '<s>The man worked as a mechanic.</s>',
|
101 |
+
'score': 0.08702439814805984,
|
102 |
+
'token': 25682,
|
103 |
+
'token_str': 'Ġmechanic'},
|
104 |
+
{'sequence': '<s>The man worked as a waiter.</s>',
|
105 |
+
'score': 0.0819653645157814,
|
106 |
+
'token': 38233,
|
107 |
+
'token_str': 'Ġwaiter'},
|
108 |
+
{'sequence': '<s>The man worked as a butcher.</s>',
|
109 |
+
'score': 0.073323555290699,
|
110 |
+
'token': 32364,
|
111 |
+
'token_str': 'Ġbutcher'},
|
112 |
+
{'sequence': '<s>The man worked as a miner.</s>',
|
113 |
+
'score': 0.046322137117385864,
|
114 |
+
'token': 18678,
|
115 |
+
'token_str': 'Ġminer'},
|
116 |
+
{'sequence': '<s>The man worked as a guard.</s>',
|
117 |
+
'score': 0.040150221437215805,
|
118 |
+
'token': 2510,
|
119 |
+
'token_str': 'Ġguard'}]
|
120 |
+
|
121 |
+
>>> unmasker("The Black woman worked as a <mask>.")
|
122 |
+
|
123 |
+
[{'sequence': '<s>The Black woman worked as a waitress.</s>',
|
124 |
+
'score': 0.22177888453006744,
|
125 |
+
'token': 35698,
|
126 |
+
'token_str': 'Ġwaitress'},
|
127 |
+
{'sequence': '<s>The Black woman worked as a prostitute.</s>',
|
128 |
+
'score': 0.19288744032382965,
|
129 |
+
'token': 36289,
|
130 |
+
'token_str': 'Ġprostitute'},
|
131 |
+
{'sequence': '<s>The Black woman worked as a maid.</s>',
|
132 |
+
'score': 0.06498628109693527,
|
133 |
+
'token': 29754,
|
134 |
+
'token_str': 'Ġmaid'},
|
135 |
+
{'sequence': '<s>The Black woman worked as a secretary.</s>',
|
136 |
+
'score': 0.05375480651855469,
|
137 |
+
'token': 2971,
|
138 |
+
'token_str': 'Ġsecretary'},
|
139 |
+
{'sequence': '<s>The Black woman worked as a nurse.</s>',
|
140 |
+
'score': 0.05245552211999893,
|
141 |
+
'token': 9008,
|
142 |
+
'token_str': 'Ġnurse'}]
|
143 |
+
```
|
144 |
+
|
145 |
+
This bias will also affect all fine-tuned versions of this model.
|
146 |
+
|
147 |
+
## Training data
|
148 |
+
|
149 |
+
The RoBERTa model was pretrained on the reunion of five datasets:
|
150 |
+
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;
|
151 |
+
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;
|
152 |
+
- [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news
|
153 |
+
articles crawled between September 2016 and February 2019.
|
154 |
+
- [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to
|
155 |
+
train GPT-2,
|
156 |
+
- [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the
|
157 |
+
story-like style of Winograd schemas.
|
158 |
+
|
159 |
+
Together these datasets weigh 160GB of text.
|
160 |
+
|
161 |
+
## Training procedure
|
162 |
+
|
163 |
+
### Preprocessing
|
164 |
+
|
165 |
+
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of
|
166 |
+
the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked
|
167 |
+
with `<s>` and the end of one by `</s>`
|
168 |
+
|
169 |
+
The details of the masking procedure for each sentence are the following:
|
170 |
+
- 15% of the tokens are masked.
|
171 |
+
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
|
172 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
173 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
174 |
+
|
175 |
+
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
|
176 |
+
|
177 |
+
### Pretraining
|
178 |
+
|
179 |
+
The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The
|
180 |
+
optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
|
181 |
+
\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
|
182 |
+
rate after.
|
183 |
+
|
184 |
+
## Evaluation results
|
185 |
+
|
186 |
+
When fine-tuned on downstream tasks, this model achieves the following results:
|
187 |
+
|
188 |
+
Glue test results:
|
189 |
+
|
190 |
+
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|
191 |
+
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
|
192 |
+
| | 87.6 | 91.9 | 92.8 | 94.8 | 63.6 | 91.2 | 90.2 | 78.7 |
|
193 |
+
|
194 |
+
|
195 |
+
### BibTeX entry and citation info
|
196 |
+
|
197 |
+
```bibtex
|
198 |
+
@article{DBLP:journals/corr/abs-1907-11692,
|
199 |
+
author = {Yinhan Liu and
|
200 |
+
Myle Ott and
|
201 |
+
Naman Goyal and
|
202 |
+
Jingfei Du and
|
203 |
+
Mandar Joshi and
|
204 |
+
Danqi Chen and
|
205 |
+
Omer Levy and
|
206 |
+
Mike Lewis and
|
207 |
+
Luke Zettlemoyer and
|
208 |
+
Veselin Stoyanov},
|
209 |
+
title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
|
210 |
+
journal = {CoRR},
|
211 |
+
volume = {abs/1907.11692},
|
212 |
+
year = {2019},
|
213 |
+
url = {http://arxiv.org/abs/1907.11692},
|
214 |
+
archivePrefix = {arXiv},
|
215 |
+
eprint = {1907.11692},
|
216 |
+
timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
|
217 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
|
218 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
219 |
+
}
|
220 |
+
```
|
221 |
+
|
222 |
+
<a href="https://huggingface.co/exbert/?model=roberta-base">
|
223 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
224 |
+
</a>
|
tests/cards/roberta-large.md
ADDED
@@ -0,0 +1,225 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# RoBERTa large model
|
2 |
+
|
3 |
+
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
|
4 |
+
[this paper](https://arxiv.org/abs/1907.11692) and first released in
|
5 |
+
[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it
|
6 |
+
makes a difference between english and English.
|
7 |
+
|
8 |
+
Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by
|
9 |
+
the Hugging Face team.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
|
13 |
+
RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means
|
14 |
+
it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
|
15 |
+
publicly available data) with an automatic process to generate inputs and labels from those texts.
|
16 |
+
|
17 |
+
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
|
18 |
+
randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
|
19 |
+
the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
|
20 |
+
after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
|
21 |
+
learn a bidirectional representation of the sentence.
|
22 |
+
|
23 |
+
This way, the model learns an inner representation of the English language that can then be used to extract features
|
24 |
+
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
|
25 |
+
classifier using the features produced by the BERT model as inputs.
|
26 |
+
|
27 |
+
## Intended uses & limitations
|
28 |
+
|
29 |
+
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
|
30 |
+
See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that
|
31 |
+
interests you.
|
32 |
+
|
33 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
34 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
35 |
+
generation you should look at model like GPT2.
|
36 |
+
|
37 |
+
### How to use
|
38 |
+
|
39 |
+
You can use this model directly with a pipeline for masked language modeling:
|
40 |
+
|
41 |
+
```python
|
42 |
+
>>> from transformers import pipeline
|
43 |
+
>>> unmasker = pipeline('fill-mask', model='roberta-large')
|
44 |
+
>>> unmasker("Hello I'm a <mask> model.")
|
45 |
+
|
46 |
+
[{'sequence': "<s>Hello I'm a male model.</s>",
|
47 |
+
'score': 0.3317350447177887,
|
48 |
+
'token': 2943,
|
49 |
+
'token_str': 'Ġmale'},
|
50 |
+
{'sequence': "<s>Hello I'm a fashion model.</s>",
|
51 |
+
'score': 0.14171843230724335,
|
52 |
+
'token': 2734,
|
53 |
+
'token_str': 'Ġfashion'},
|
54 |
+
{'sequence': "<s>Hello I'm a professional model.</s>",
|
55 |
+
'score': 0.04291723668575287,
|
56 |
+
'token': 2038,
|
57 |
+
'token_str': 'Ġprofessional'},
|
58 |
+
{'sequence': "<s>Hello I'm a freelance model.</s>",
|
59 |
+
'score': 0.02134818211197853,
|
60 |
+
'token': 18150,
|
61 |
+
'token_str': 'Ġfreelance'},
|
62 |
+
{'sequence': "<s>Hello I'm a young model.</s>",
|
63 |
+
'score': 0.021098261699080467,
|
64 |
+
'token': 664,
|
65 |
+
'token_str': 'Ġyoung'}]
|
66 |
+
```
|
67 |
+
|
68 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
69 |
+
|
70 |
+
```python
|
71 |
+
from transformers import RobertaTokenizer, RobertaModel
|
72 |
+
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
|
73 |
+
model = RobertaModel.from_pretrained('roberta-large')
|
74 |
+
text = "Replace me by any text you'd like."
|
75 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
76 |
+
output = model(**encoded_input)
|
77 |
+
```
|
78 |
+
|
79 |
+
and in TensorFlow:
|
80 |
+
|
81 |
+
```python
|
82 |
+
from transformers import RobertaTokenizer, TFRobertaModel
|
83 |
+
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
|
84 |
+
model = TFRobertaModel.from_pretrained('roberta-large')
|
85 |
+
text = "Replace me by any text you'd like."
|
86 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
87 |
+
output = model(encoded_input)
|
88 |
+
```
|
89 |
+
|
90 |
+
### Limitations and bias
|
91 |
+
|
92 |
+
The training data used for this model contains a lot of unfiltered content from the internet, which is far from
|
93 |
+
neutral. Therefore, the model can have biased predictions:
|
94 |
+
|
95 |
+
```python
|
96 |
+
>>> from transformers import pipeline
|
97 |
+
>>> unmasker = pipeline('fill-mask', model='roberta-large')
|
98 |
+
>>> unmasker("The man worked as a <mask>.")
|
99 |
+
|
100 |
+
[{'sequence': '<s>The man worked as a mechanic.</s>',
|
101 |
+
'score': 0.08260300755500793,
|
102 |
+
'token': 25682,
|
103 |
+
'token_str': 'Ġmechanic'},
|
104 |
+
{'sequence': '<s>The man worked as a driver.</s>',
|
105 |
+
'score': 0.05736079439520836,
|
106 |
+
'token': 1393,
|
107 |
+
'token_str': 'Ġdriver'},
|
108 |
+
{'sequence': '<s>The man worked as a teacher.</s>',
|
109 |
+
'score': 0.04709019884467125,
|
110 |
+
'token': 3254,
|
111 |
+
'token_str': 'Ġteacher'},
|
112 |
+
{'sequence': '<s>The man worked as a bartender.</s>',
|
113 |
+
'score': 0.04641604796051979,
|
114 |
+
'token': 33080,
|
115 |
+
'token_str': 'Ġbartender'},
|
116 |
+
{'sequence': '<s>The man worked as a waiter.</s>',
|
117 |
+
'score': 0.04239227622747421,
|
118 |
+
'token': 38233,
|
119 |
+
'token_str': 'Ġwaiter'}]
|
120 |
+
|
121 |
+
>>> unmasker("The woman worked as a <mask>.")
|
122 |
+
|
123 |
+
[{'sequence': '<s>The woman worked as a nurse.</s>',
|
124 |
+
'score': 0.2667474150657654,
|
125 |
+
'token': 9008,
|
126 |
+
'token_str': 'Ġnurse'},
|
127 |
+
{'sequence': '<s>The woman worked as a waitress.</s>',
|
128 |
+
'score': 0.12280137836933136,
|
129 |
+
'token': 35698,
|
130 |
+
'token_str': 'Ġwaitress'},
|
131 |
+
{'sequence': '<s>The woman worked as a teacher.</s>',
|
132 |
+
'score': 0.09747499972581863,
|
133 |
+
'token': 3254,
|
134 |
+
'token_str': 'Ġteacher'},
|
135 |
+
{'sequence': '<s>The woman worked as a secretary.</s>',
|
136 |
+
'score': 0.05783602222800255,
|
137 |
+
'token': 2971,
|
138 |
+
'token_str': 'Ġsecretary'},
|
139 |
+
{'sequence': '<s>The woman worked as a cleaner.</s>',
|
140 |
+
'score': 0.05576248839497566,
|
141 |
+
'token': 16126,
|
142 |
+
'token_str': 'Ġcleaner'}]
|
143 |
+
```
|
144 |
+
|
145 |
+
This bias will also affect all fine-tuned versions of this model.
|
146 |
+
|
147 |
+
## Training data
|
148 |
+
|
149 |
+
The RoBERTa model was pretrained on the reunion of five datasets:
|
150 |
+
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;
|
151 |
+
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;
|
152 |
+
- [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news
|
153 |
+
articles crawled between September 2016 and February 2019.
|
154 |
+
- [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to
|
155 |
+
train GPT-2,
|
156 |
+
- [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the
|
157 |
+
story-like style of Winograd schemas.
|
158 |
+
|
159 |
+
Together theses datasets weight 160GB of text.
|
160 |
+
|
161 |
+
## Training procedure
|
162 |
+
|
163 |
+
### Preprocessing
|
164 |
+
|
165 |
+
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of
|
166 |
+
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
167 |
+
with `<s>` and the end of one by `</s>`
|
168 |
+
|
169 |
+
The details of the masking procedure for each sentence are the following:
|
170 |
+
- 15% of the tokens are masked.
|
171 |
+
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
|
172 |
+
|
173 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
174 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
175 |
+
|
176 |
+
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
|
177 |
+
|
178 |
+
### Pretraining
|
179 |
+
|
180 |
+
The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The
|
181 |
+
optimizer used is Adam with a learning rate of 4e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
|
182 |
+
\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning
|
183 |
+
rate after.
|
184 |
+
|
185 |
+
## Evaluation results
|
186 |
+
|
187 |
+
When fine-tuned on downstream tasks, this model achieves the following results:
|
188 |
+
|
189 |
+
Glue test results:
|
190 |
+
|
191 |
+
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|
192 |
+
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
|
193 |
+
| | 90.2 | 92.2 | 94.7 | 96.4 | 68.0 | 96.4 | 90.9 | 86.6 |
|
194 |
+
|
195 |
+
|
196 |
+
### BibTeX entry and citation info
|
197 |
+
|
198 |
+
```bibtex
|
199 |
+
@article{DBLP:journals/corr/abs-1907-11692,
|
200 |
+
author = {Yinhan Liu and
|
201 |
+
Myle Ott and
|
202 |
+
Naman Goyal and
|
203 |
+
Jingfei Du and
|
204 |
+
Mandar Joshi and
|
205 |
+
Danqi Chen and
|
206 |
+
Omer Levy and
|
207 |
+
Mike Lewis and
|
208 |
+
Luke Zettlemoyer and
|
209 |
+
Veselin Stoyanov},
|
210 |
+
title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
|
211 |
+
journal = {CoRR},
|
212 |
+
volume = {abs/1907.11692},
|
213 |
+
year = {2019},
|
214 |
+
url = {http://arxiv.org/abs/1907.11692},
|
215 |
+
archivePrefix = {arXiv},
|
216 |
+
eprint = {1907.11692},
|
217 |
+
timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
|
218 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
|
219 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
220 |
+
}
|
221 |
+
```
|
222 |
+
|
223 |
+
<a href="https://huggingface.co/exbert/?model=roberta-base">
|
224 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
225 |
+
</a>
|
tests/cards/runwayml___stable-diffusion-v1-5.md
ADDED
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Stable Diffusion v1-5 Model Card
|
2 |
+
|
3 |
+
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
|
4 |
+
For more information about how Stable Diffusion functions, please have a look at [🤗's Stable Diffusion blog](https://huggingface.co/blog/stable_diffusion).
|
5 |
+
|
6 |
+
The **Stable-Diffusion-v1-5** checkpoint was initialized with the weights of the [Stable-Diffusion-v1-2](https:/steps/huggingface.co/CompVis/stable-diffusion-v1-2)
|
7 |
+
checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
8 |
+
|
9 |
+
You can use this both with the [🧨Diffusers library](https://github.com/huggingface/diffusers) and the [RunwayML GitHub repository](https://github.com/runwayml/stable-diffusion).
|
10 |
+
|
11 |
+
### Diffusers
|
12 |
+
```py
|
13 |
+
from diffusers import StableDiffusionPipeline
|
14 |
+
import torch
|
15 |
+
|
16 |
+
model_id = "runwayml/stable-diffusion-v1-5"
|
17 |
+
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
|
18 |
+
pipe = pipe.to("cuda")
|
19 |
+
|
20 |
+
prompt = "a photo of an astronaut riding a horse on mars"
|
21 |
+
image = pipe(prompt).images[0]
|
22 |
+
|
23 |
+
image.save("astronaut_rides_horse.png")
|
24 |
+
```
|
25 |
+
For more detailed instructions, use-cases and examples in JAX follow the instructions [here](https://github.com/huggingface/diffusers#text-to-image-generation-with-stable-diffusion)
|
26 |
+
|
27 |
+
### Original GitHub Repository
|
28 |
+
|
29 |
+
1. Download the weights
|
30 |
+
- [v1-5-pruned-emaonly.ckpt](https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.ckpt) - 4.27GB, ema-only weight. uses less VRAM - suitable for inference
|
31 |
+
- [v1-5-pruned.ckpt](https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned.ckpt) - 7.7GB, ema+non-ema weights. uses more VRAM - suitable for fine-tuning
|
32 |
+
|
33 |
+
2. Follow instructions [here](https://github.com/runwayml/stable-diffusion).
|
34 |
+
|
35 |
+
## Model Details
|
36 |
+
- **Developed by:** Robin Rombach, Patrick Esser
|
37 |
+
- **Model type:** Diffusion-based text-to-image generation model
|
38 |
+
- **Language(s):** English
|
39 |
+
- **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
|
40 |
+
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([CLIP ViT-L/14](https://arxiv.org/abs/2103.00020)) as suggested in the [Imagen paper](https://arxiv.org/abs/2205.11487).
|
41 |
+
- **Resources for more information:** [GitHub Repository](https://github.com/CompVis/stable-diffusion), [Paper](https://arxiv.org/abs/2112.10752).
|
42 |
+
- **Cite as:**
|
43 |
+
|
44 |
+
@InProceedings{Rombach_2022_CVPR,
|
45 |
+
author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
|
46 |
+
title = {High-Resolution Image Synthesis With Latent Diffusion Models},
|
47 |
+
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
48 |
+
month = {June},
|
49 |
+
year = {2022},
|
50 |
+
pages = {10684-10695}
|
51 |
+
}
|
52 |
+
|
53 |
+
# Uses
|
54 |
+
|
55 |
+
## Direct Use
|
56 |
+
The model is intended for research purposes only. Possible research areas and
|
57 |
+
tasks include
|
58 |
+
|
59 |
+
- Safe deployment of models which have the potential to generate harmful content.
|
60 |
+
- Probing and understanding the limitations and biases of generative models.
|
61 |
+
- Generation of artworks and use in design and other artistic processes.
|
62 |
+
- Applications in educational or creative tools.
|
63 |
+
- Research on generative models.
|
64 |
+
|
65 |
+
Excluded uses are described below.
|
66 |
+
|
67 |
+
### Misuse, Malicious Use, and Out-of-Scope Use
|
68 |
+
_Note: This section is taken from the [DALLE-MINI model card](https://huggingface.co/dalle-mini/dalle-mini), but applies in the same way to Stable Diffusion v1_.
|
69 |
+
|
70 |
+
|
71 |
+
The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
|
72 |
+
|
73 |
+
#### Out-of-Scope Use
|
74 |
+
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
|
75 |
+
|
76 |
+
#### Misuse and Malicious Use
|
77 |
+
Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
|
78 |
+
|
79 |
+
- Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
|
80 |
+
- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
|
81 |
+
- Impersonating individuals without their consent.
|
82 |
+
- Sexual content without consent of the people who might see it.
|
83 |
+
- Mis- and disinformation
|
84 |
+
- Representations of egregious violence and gore
|
85 |
+
- Sharing of copyrighted or licensed material in violation of its terms of use.
|
86 |
+
- Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
|
87 |
+
|
88 |
+
## Limitations and Bias
|
89 |
+
|
90 |
+
### Limitations
|
91 |
+
|
92 |
+
- The model does not achieve perfect photorealism
|
93 |
+
- The model cannot render legible text
|
94 |
+
- The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
|
95 |
+
- Faces and people in general may not be generated properly.
|
96 |
+
- The model was trained mainly with English captions and will not work as well in other languages.
|
97 |
+
- The autoencoding part of the model is lossy
|
98 |
+
- The model was trained on a large-scale dataset
|
99 |
+
[LAION-5B](https://laion.ai/blog/laion-5b/) which contains adult material
|
100 |
+
and is not fit for product use without additional safety mechanisms and
|
101 |
+
considerations.
|
102 |
+
- No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data.
|
103 |
+
The training data can be searched at [https://rom1504.github.io/clip-retrieval/](https://rom1504.github.io/clip-retrieval/) to possibly assist in the detection of memorized images.
|
104 |
+
|
105 |
+
### Bias
|
106 |
+
|
107 |
+
While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
|
108 |
+
Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
|
109 |
+
which consists of images that are primarily limited to English descriptions.
|
110 |
+
Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for.
|
111 |
+
This affects the overall output of the model, as white and western cultures are often set as the default. Further, the
|
112 |
+
ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.
|
113 |
+
|
114 |
+
### Safety Module
|
115 |
+
|
116 |
+
The intended use of this model is with the [Safety Checker](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) in Diffusers.
|
117 |
+
This checker works by checking model outputs against known hard-coded NSFW concepts.
|
118 |
+
The concepts are intentionally hidden to reduce the likelihood of reverse-engineering this filter.
|
119 |
+
Specifically, the checker compares the class probability of harmful concepts in the embedding space of the `CLIPTextModel` *after generation* of the images.
|
120 |
+
The concepts are passed into the model with the generated image and compared to a hand-engineered weight for each NSFW concept.
|
121 |
+
|
122 |
+
|
123 |
+
## Training
|
124 |
+
|
125 |
+
**Training Data**
|
126 |
+
The model developers used the following dataset for training the model:
|
127 |
+
|
128 |
+
- LAION-2B (en) and subsets thereof (see next section)
|
129 |
+
|
130 |
+
**Training Procedure**
|
131 |
+
Stable Diffusion v1-5 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
|
132 |
+
|
133 |
+
- Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
|
134 |
+
- Text prompts are encoded through a ViT-L/14 text-encoder.
|
135 |
+
- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
136 |
+
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
|
137 |
+
|
138 |
+
Currently six Stable Diffusion checkpoints are provided, which were trained as follows.
|
139 |
+
- [`stable-diffusion-v1-1`](https://huggingface.co/CompVis/stable-diffusion-v1-1): 237,000 steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
|
140 |
+
194,000 steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
|
141 |
+
- [`stable-diffusion-v1-2`](https://huggingface.co/CompVis/stable-diffusion-v1-2): Resumed from `stable-diffusion-v1-1`.
|
142 |
+
515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
|
143 |
+
filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
|
144 |
+
- [`stable-diffusion-v1-3`](https://huggingface.co/CompVis/stable-diffusion-v1-3): Resumed from `stable-diffusion-v1-2` - 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
145 |
+
- [`stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) Resumed from `stable-diffusion-v1-2` - 225,000 steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
146 |
+
- [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) Resumed from `stable-diffusion-v1-2` - 595,000 steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
147 |
+
- [`stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting) Resumed from `stable-diffusion-v1-5` - then 440,000 steps of inpainting training at resolution 512x512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything.
|
148 |
+
|
149 |
+
- **Hardware:** 32 x 8 x A100 GPUs
|
150 |
+
- **Optimizer:** AdamW
|
151 |
+
- **Gradient Accumulations**: 2
|
152 |
+
- **Batch:** 32 x 8 x 2 x 4 = 2048
|
153 |
+
- **Learning rate:** warmup to 0.0001 for 10,000 steps and then kept constant
|
154 |
+
|
155 |
+
## Evaluation Results
|
156 |
+
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
157 |
+
5.0, 6.0, 7.0, 8.0) and 50 PNDM/PLMS sampling
|
158 |
+
steps show the relative improvements of the checkpoints:
|
159 |
+
|
160 |
+
![pareto](https://huggingface.co/CompVis/stable-diffusion/resolve/main/v1-1-to-v1-5.png)
|
161 |
+
|
162 |
+
Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution. Not optimized for FID scores.
|
163 |
+
## Environmental Impact
|
164 |
+
|
165 |
+
**Stable Diffusion v1** **Estimated Emissions**
|
166 |
+
Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
|
167 |
+
|
168 |
+
- **Hardware Type:** A100 PCIe 40GB
|
169 |
+
- **Hours used:** 150000
|
170 |
+
- **Cloud Provider:** AWS
|
171 |
+
- **Compute Region:** US-east
|
172 |
+
- **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
|
173 |
+
|
174 |
+
|
175 |
+
## Citation
|
176 |
+
|
177 |
+
```bibtex
|
178 |
+
@InProceedings{Rombach_2022_CVPR,
|
179 |
+
author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
|
180 |
+
title = {High-Resolution Image Synthesis With Latent Diffusion Models},
|
181 |
+
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
182 |
+
month = {June},
|
183 |
+
year = {2022},
|
184 |
+
pages = {10684-10695}
|
185 |
+
}
|
186 |
+
```
|
187 |
+
|
188 |
+
*This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*
|
tests/cards/sentence-transformers___all-MiniLM-L6-v2.md
ADDED
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# all-MiniLM-L6-v2
|
2 |
+
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
3 |
+
|
4 |
+
## Usage (Sentence-Transformers)
|
5 |
+
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
6 |
+
|
7 |
+
```
|
8 |
+
pip install -U sentence-transformers
|
9 |
+
```
|
10 |
+
|
11 |
+
Then you can use the model like this:
|
12 |
+
```python
|
13 |
+
from sentence_transformers import SentenceTransformer
|
14 |
+
sentences = ["This is an example sentence", "Each sentence is converted"]
|
15 |
+
|
16 |
+
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
|
17 |
+
embeddings = model.encode(sentences)
|
18 |
+
print(embeddings)
|
19 |
+
```
|
20 |
+
|
21 |
+
## Usage (HuggingFace Transformers)
|
22 |
+
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
23 |
+
|
24 |
+
```python
|
25 |
+
from transformers import AutoTokenizer, AutoModel
|
26 |
+
import torch
|
27 |
+
import torch.nn.functional as F
|
28 |
+
|
29 |
+
#Mean Pooling - Take attention mask into account for correct averaging
|
30 |
+
def mean_pooling(model_output, attention_mask):
|
31 |
+
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
32 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
33 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
34 |
+
|
35 |
+
|
36 |
+
# Sentences we want sentence embeddings for
|
37 |
+
sentences = ['This is an example sentence', 'Each sentence is converted']
|
38 |
+
|
39 |
+
# Load model from HuggingFace Hub
|
40 |
+
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
41 |
+
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
42 |
+
|
43 |
+
# Tokenize sentences
|
44 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
45 |
+
|
46 |
+
# Compute token embeddings
|
47 |
+
with torch.no_grad():
|
48 |
+
model_output = model(**encoded_input)
|
49 |
+
|
50 |
+
# Perform pooling
|
51 |
+
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
52 |
+
|
53 |
+
# Normalize embeddings
|
54 |
+
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
|
55 |
+
|
56 |
+
print("Sentence embeddings:")
|
57 |
+
print(sentence_embeddings)
|
58 |
+
```
|
59 |
+
|
60 |
+
## Evaluation Results
|
61 |
+
|
62 |
+
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/all-MiniLM-L6-v2)
|
63 |
+
|
64 |
+
------
|
65 |
+
|
66 |
+
## Background
|
67 |
+
|
68 |
+
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
69 |
+
contrastive learning objective. We used the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model and fine-tuned in on a
|
70 |
+
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
|
71 |
+
|
72 |
+
We developped this model during the
|
73 |
+
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
|
74 |
+
organized by Hugging Face. We developped this model as part of the project:
|
75 |
+
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
|
76 |
+
|
77 |
+
## Intended uses
|
78 |
+
|
79 |
+
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
|
80 |
+
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
|
81 |
+
|
82 |
+
By default, input text longer than 256 word pieces is truncated.
|
83 |
+
|
84 |
+
|
85 |
+
## Training procedure
|
86 |
+
|
87 |
+
### Pre-training
|
88 |
+
|
89 |
+
We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
|
90 |
+
|
91 |
+
### Fine-tuning
|
92 |
+
|
93 |
+
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
|
94 |
+
We then apply the cross entropy loss by comparing with true pairs.
|
95 |
+
|
96 |
+
#### Hyper parameters
|
97 |
+
|
98 |
+
We trained ou model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
|
99 |
+
We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
|
100 |
+
a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
|
101 |
+
|
102 |
+
#### Training data
|
103 |
+
|
104 |
+
We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences.
|
105 |
+
We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
|
106 |
+
|
107 |
+
|
108 |
+
| Dataset | Paper | Number of training tuples |
|
109 |
+
|--------------------------------------------------------|:----------------------------------------:|:--------------------------:|
|
110 |
+
| [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit) | [paper](https://arxiv.org/abs/1904.06472) | 726,484,430 |
|
111 |
+
| [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts) | [paper](https://aclanthology.org/2020.acl-main.447/) | 116,288,806 |
|
112 |
+
| [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs | [paper](https://doi.org/10.1145/2623330.2623677) | 77,427,422 |
|
113 |
+
| [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs | [paper](https://arxiv.org/abs/2102.07033) | 64,371,441 |
|
114 |
+
| [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles) | [paper](https://aclanthology.org/2020.acl-main.447/) | 52,603,982 |
|
115 |
+
| [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract) | [paper](https://aclanthology.org/2020.acl-main.447/) | 41,769,185 |
|
116 |
+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs | - | 25,316,456 |
|
117 |
+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs | - | 21,396,559 |
|
118 |
+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs | - | 21,396,559 |
|
119 |
+
| [MS MARCO](https://microsoft.github.io/msmarco/) triplets | [paper](https://doi.org/10.1145/3404835.3462804) | 9,144,553 |
|
120 |
+
| [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) | [paper](https://arxiv.org/pdf/2104.08727.pdf) | 3,012,496 |
|
121 |
+
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 1,198,260 |
|
122 |
+
| [Code Search](https://huggingface.co/datasets/code_search_net) | - | 1,151,414 |
|
123 |
+
| [COCO](https://cocodataset.org/#home) Image captions | [paper](https://link.springer.com/chapter/10.1007%2F978-3-319-10602-1_48) | 828,395|
|
124 |
+
| [SPECTER](https://github.com/allenai/specter) citation triplets | [paper](https://doi.org/10.18653/v1/2020.acl-main.207) | 684,100 |
|
125 |
+
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 681,164 |
|
126 |
+
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 659,896 |
|
127 |
+
| [SearchQA](https://huggingface.co/datasets/search_qa) | [paper](https://arxiv.org/abs/1704.05179) | 582,261 |
|
128 |
+
| [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
|
129 |
+
| [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
|
130 |
+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
|
131 |
+
| AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
|
132 |
+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
|
133 |
+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
|
134 |
+
| [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
|
135 |
+
| [Wikihow](https://github.com/pvl/wikihow_pairs_dataset) | [paper](https://arxiv.org/abs/1810.09305) | 128,542 |
|
136 |
+
| [Altlex](https://github.com/chridey/altlex/) | [paper](https://aclanthology.org/P16-1135.pdf) | 112,696 |
|
137 |
+
| [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | - | 103,663 |
|
138 |
+
| [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/) | [paper](https://www.aclweb.org/anthology/P11-2117/) | 102,225 |
|
139 |
+
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
|
140 |
+
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
|
141 |
+
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
|
142 |
+
| **Total** | | **1,170,060,424** |
|
tests/cards/t5-base.md
ADDED
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card for T5 Base
|
2 |
+
|
3 |
+
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
|
4 |
+
|
5 |
+
# Table of Contents
|
6 |
+
|
7 |
+
1. [Model Details](#model-details)
|
8 |
+
2. [Uses](#uses)
|
9 |
+
3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
10 |
+
4. [Training Details](#training-details)
|
11 |
+
5. [Evaluation](#evaluation)
|
12 |
+
6. [Environmental Impact](#environmental-impact)
|
13 |
+
7. [Citation](#citation)
|
14 |
+
8. [Model Card Authors](#model-card-authors)
|
15 |
+
9. [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
16 |
+
|
17 |
+
# Model Details
|
18 |
+
|
19 |
+
## Model Description
|
20 |
+
|
21 |
+
The developers of the Text-To-Text Transfer Transformer (T5) [write](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html):
|
22 |
+
|
23 |
+
> With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.
|
24 |
+
|
25 |
+
T5-Base is the checkpoint with 220 million parameters.
|
26 |
+
|
27 |
+
- **Developed by:** Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. See [associated paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) and [GitHub repo](https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints)
|
28 |
+
- **Model type:** Language model
|
29 |
+
- **Language(s) (NLP):** English, French, Romanian, German
|
30 |
+
- **License:** Apache 2.0
|
31 |
+
- **Related Models:** [All T5 Checkpoints](https://huggingface.co/models?search=t5)
|
32 |
+
- **Resources for more information:**
|
33 |
+
- [Research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf)
|
34 |
+
- [Google's T5 Blog Post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
|
35 |
+
- [GitHub Repo](https://github.com/google-research/text-to-text-transfer-transformer)
|
36 |
+
- [Hugging Face T5 Docs](https://huggingface.co/docs/transformers/model_doc/t5)
|
37 |
+
|
38 |
+
# Uses
|
39 |
+
|
40 |
+
## Direct Use and Downstream Use
|
41 |
+
|
42 |
+
The developers write in a [blog post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) that the model:
|
43 |
+
|
44 |
+
> Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). We can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.
|
45 |
+
|
46 |
+
See the [blog post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) for further details.
|
47 |
+
|
48 |
+
## Out-of-Scope Use
|
49 |
+
|
50 |
+
More information needed.
|
51 |
+
|
52 |
+
# Bias, Risks, and Limitations
|
53 |
+
|
54 |
+
More information needed.
|
55 |
+
|
56 |
+
## Recommendations
|
57 |
+
|
58 |
+
More information needed.
|
59 |
+
|
60 |
+
# Training Details
|
61 |
+
|
62 |
+
## Training Data
|
63 |
+
|
64 |
+
The model is pre-trained on the [Colossal Clean Crawled Corpus (C4)](https://www.tensorflow.org/datasets/catalog/c4), which was developed and released in the context of the same [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) as T5.
|
65 |
+
|
66 |
+
The model was pre-trained on a on a **multi-task mixture of unsupervised (1.) and supervised tasks (2.)**.
|
67 |
+
Thereby, the following datasets were being used for (1.) and (2.):
|
68 |
+
|
69 |
+
1. **Datasets used for Unsupervised denoising objective**:
|
70 |
+
|
71 |
+
- [C4](https://huggingface.co/datasets/c4)
|
72 |
+
- [Wiki-DPR](https://huggingface.co/datasets/wiki_dpr)
|
73 |
+
|
74 |
+
|
75 |
+
2. **Datasets used for Supervised text-to-text language modeling objective**
|
76 |
+
|
77 |
+
- Sentence acceptability judgment
|
78 |
+
- CoLA [Warstadt et al., 2018](https://arxiv.org/abs/1805.12471)
|
79 |
+
- Sentiment analysis
|
80 |
+
- SST-2 [Socher et al., 2013](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)
|
81 |
+
- Paraphrasing/sentence similarity
|
82 |
+
- MRPC [Dolan and Brockett, 2005](https://aclanthology.org/I05-5002)
|
83 |
+
- STS-B [Ceret al., 2017](https://arxiv.org/abs/1708.00055)
|
84 |
+
- QQP [Iyer et al., 2017](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs)
|
85 |
+
- Natural language inference
|
86 |
+
- MNLI [Williams et al., 2017](https://arxiv.org/abs/1704.05426)
|
87 |
+
- QNLI [Rajpurkar et al.,2016](https://arxiv.org/abs/1606.05250)
|
88 |
+
- RTE [Dagan et al., 2005](https://link.springer.com/chapter/10.1007/11736790_9)
|
89 |
+
- CB [De Marneff et al., 2019](https://semanticsarchive.net/Archive/Tg3ZGI2M/Marneffe.pdf)
|
90 |
+
- Sentence completion
|
91 |
+
- COPA [Roemmele et al., 2011](https://www.researchgate.net/publication/221251392_Choice_of_Plausible_Alternatives_An_Evaluation_of_Commonsense_Causal_Reasoning)
|
92 |
+
- Word sense disambiguation
|
93 |
+
- WIC [Pilehvar and Camacho-Collados, 2018](https://arxiv.org/abs/1808.09121)
|
94 |
+
- Question answering
|
95 |
+
- MultiRC [Khashabi et al., 2018](https://aclanthology.org/N18-1023)
|
96 |
+
- ReCoRD [Zhang et al., 2018](https://arxiv.org/abs/1810.12885)
|
97 |
+
- BoolQ [Clark et al., 2019](https://arxiv.org/abs/1905.10044)
|
98 |
+
|
99 |
+
## Training Procedure
|
100 |
+
|
101 |
+
In their [abstract](https://jmlr.org/papers/volume21/20-074/20-074.pdf), the model developers write:
|
102 |
+
|
103 |
+
> In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks.
|
104 |
+
|
105 |
+
The framework introduced, the T5 framework, involves a training procedure that brings together the approaches studied in the paper. See the [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) for further details.
|
106 |
+
|
107 |
+
# Evaluation
|
108 |
+
|
109 |
+
## Testing Data, Factors & Metrics
|
110 |
+
|
111 |
+
The developers evaluated the model on 24 tasks, see the [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) for full details.
|
112 |
+
|
113 |
+
## Results
|
114 |
+
|
115 |
+
For full results for T5-Base, see the [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf), Table 14.
|
116 |
+
|
117 |
+
# Environmental Impact
|
118 |
+
|
119 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
120 |
+
|
121 |
+
- **Hardware Type:** Google Cloud TPU Pods
|
122 |
+
- **Hours used:** More information needed
|
123 |
+
- **Cloud Provider:** GCP
|
124 |
+
- **Compute Region:** More information needed
|
125 |
+
- **Carbon Emitted:** More information needed
|
126 |
+
|
127 |
+
# Citation
|
128 |
+
|
129 |
+
**BibTeX:**
|
130 |
+
|
131 |
+
```bibtex
|
132 |
+
@article{2020t5,
|
133 |
+
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
|
134 |
+
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
|
135 |
+
journal = {Journal of Machine Learning Research},
|
136 |
+
year = {2020},
|
137 |
+
volume = {21},
|
138 |
+
number = {140},
|
139 |
+
pages = {1-67},
|
140 |
+
url = {http://jmlr.org/papers/v21/20-074.html}
|
141 |
+
}
|
142 |
+
```
|
143 |
+
|
144 |
+
**APA:**
|
145 |
+
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), 1-67.
|
146 |
+
|
147 |
+
# Model Card Authors
|
148 |
+
|
149 |
+
This model card was written by the team at Hugging Face.
|
150 |
+
|
151 |
+
# How to Get Started with the Model
|
152 |
+
|
153 |
+
Use the code below to get started with the model.
|
154 |
+
|
155 |
+
<details>
|
156 |
+
<summary> Click to expand </summary>
|
157 |
+
|
158 |
+
```python
|
159 |
+
from transformers import T5Tokenizer, T5Model
|
160 |
+
|
161 |
+
tokenizer = T5Tokenizer.from_pretrained("t5-base")
|
162 |
+
model = T5Model.from_pretrained("t5-base")
|
163 |
+
|
164 |
+
input_ids = tokenizer(
|
165 |
+
"Studies have been shown that owning a dog is good for you", return_tensors="pt"
|
166 |
+
).input_ids # Batch size 1
|
167 |
+
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
|
168 |
+
|
169 |
+
# forward pass
|
170 |
+
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
|
171 |
+
last_hidden_states = outputs.last_hidden_state
|
172 |
+
```
|
173 |
+
|
174 |
+
See the [Hugging Face T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model) docs and a [Colab Notebook](https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/main/notebooks/t5-trivia.ipynb) created by the model developers for more examples.
|
175 |
+
</details>
|
tests/cards/t5-small.md
ADDED
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card for T5 Small
|
2 |
+
|
3 |
+
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
|
4 |
+
|
5 |
+
# Table of Contents
|
6 |
+
|
7 |
+
1. [Model Details](#model-details)
|
8 |
+
2. [Uses](#uses)
|
9 |
+
3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
10 |
+
4. [Training Details](#training-details)
|
11 |
+
5. [Evaluation](#evaluation)
|
12 |
+
6. [Environmental Impact](#environmental-impact)
|
13 |
+
7. [Citation](#citation)
|
14 |
+
8. [Model Card Authors](#model-card-authors)
|
15 |
+
9. [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
16 |
+
|
17 |
+
# Model Details
|
18 |
+
|
19 |
+
## Model Description
|
20 |
+
|
21 |
+
The developers of the Text-To-Text Transfer Transformer (T5) [write](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html):
|
22 |
+
|
23 |
+
> With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.
|
24 |
+
|
25 |
+
T5-Small is the checkpoint with 60 million parameters.
|
26 |
+
|
27 |
+
- **Developed by:** Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. See [associated paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) and [GitHub repo](https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints)
|
28 |
+
- **Model type:** Language model
|
29 |
+
- **Language(s) (NLP):** English, French, Romanian, German
|
30 |
+
- **License:** Apache 2.0
|
31 |
+
- **Related Models:** [All T5 Checkpoints](https://huggingface.co/models?search=t5)
|
32 |
+
- **Resources for more information:**
|
33 |
+
- [Research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf)
|
34 |
+
- [Google's T5 Blog Post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
|
35 |
+
- [GitHub Repo](https://github.com/google-research/text-to-text-transfer-transformer)
|
36 |
+
- [Hugging Face T5 Docs](https://huggingface.co/docs/transformers/model_doc/t5)
|
37 |
+
|
38 |
+
# Uses
|
39 |
+
|
40 |
+
## Direct Use and Downstream Use
|
41 |
+
|
42 |
+
The developers write in a [blog post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) that the model:
|
43 |
+
|
44 |
+
> Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). We can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.
|
45 |
+
|
46 |
+
See the [blog post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) for further details.
|
47 |
+
|
48 |
+
## Out-of-Scope Use
|
49 |
+
|
50 |
+
More information needed.
|
51 |
+
|
52 |
+
# Bias, Risks, and Limitations
|
53 |
+
|
54 |
+
More information needed.
|
55 |
+
|
56 |
+
## Recommendations
|
57 |
+
|
58 |
+
More information needed.
|
59 |
+
|
60 |
+
# Training Details
|
61 |
+
|
62 |
+
## Training Data
|
63 |
+
|
64 |
+
The model is pre-trained on the [Colossal Clean Crawled Corpus (C4)](https://www.tensorflow.org/datasets/catalog/c4), which was developed and released in the context of the same [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) as T5.
|
65 |
+
|
66 |
+
The model was pre-trained on a on a **multi-task mixture of unsupervised (1.) and supervised tasks (2.)**.
|
67 |
+
Thereby, the following datasets were being used for (1.) and (2.):
|
68 |
+
|
69 |
+
1. **Datasets used for Unsupervised denoising objective**:
|
70 |
+
|
71 |
+
- [C4](https://huggingface.co/datasets/c4)
|
72 |
+
- [Wiki-DPR](https://huggingface.co/datasets/wiki_dpr)
|
73 |
+
|
74 |
+
|
75 |
+
2. **Datasets used for Supervised text-to-text language modeling objective**
|
76 |
+
|
77 |
+
- Sentence acceptability judgment
|
78 |
+
- CoLA [Warstadt et al., 2018](https://arxiv.org/abs/1805.12471)
|
79 |
+
- Sentiment analysis
|
80 |
+
- SST-2 [Socher et al., 2013](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)
|
81 |
+
- Paraphrasing/sentence similarity
|
82 |
+
- MRPC [Dolan and Brockett, 2005](https://aclanthology.org/I05-5002)
|
83 |
+
- STS-B [Ceret al., 2017](https://arxiv.org/abs/1708.00055)
|
84 |
+
- QQP [Iyer et al., 2017](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs)
|
85 |
+
- Natural language inference
|
86 |
+
- MNLI [Williams et al., 2017](https://arxiv.org/abs/1704.05426)
|
87 |
+
- QNLI [Rajpurkar et al.,2016](https://arxiv.org/abs/1606.05250)
|
88 |
+
- RTE [Dagan et al., 2005](https://link.springer.com/chapter/10.1007/11736790_9)
|
89 |
+
- CB [De Marneff et al., 2019](https://semanticsarchive.net/Archive/Tg3ZGI2M/Marneffe.pdf)
|
90 |
+
- Sentence completion
|
91 |
+
- COPA [Roemmele et al., 2011](https://www.researchgate.net/publication/221251392_Choice_of_Plausible_Alternatives_An_Evaluation_of_Commonsense_Causal_Reasoning)
|
92 |
+
- Word sense disambiguation
|
93 |
+
- WIC [Pilehvar and Camacho-Collados, 2018](https://arxiv.org/abs/1808.09121)
|
94 |
+
- Question answering
|
95 |
+
- MultiRC [Khashabi et al., 2018](https://aclanthology.org/N18-1023)
|
96 |
+
- ReCoRD [Zhang et al., 2018](https://arxiv.org/abs/1810.12885)
|
97 |
+
- BoolQ [Clark et al., 2019](https://arxiv.org/abs/1905.10044)
|
98 |
+
|
99 |
+
## Training Procedure
|
100 |
+
|
101 |
+
In their [abstract](https://jmlr.org/papers/volume21/20-074/20-074.pdf), the model developers write:
|
102 |
+
|
103 |
+
> In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks.
|
104 |
+
|
105 |
+
The framework introduced, the T5 framework, involves a training procedure that brings together the approaches studied in the paper. See the [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) for further details.
|
106 |
+
|
107 |
+
# Evaluation
|
108 |
+
|
109 |
+
## Testing Data, Factors & Metrics
|
110 |
+
|
111 |
+
The developers evaluated the model on 24 tasks, see the [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) for full details.
|
112 |
+
|
113 |
+
## Results
|
114 |
+
|
115 |
+
For full results for T5-small, see the [research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf), Table 14.
|
116 |
+
|
117 |
+
# Environmental Impact
|
118 |
+
|
119 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
120 |
+
|
121 |
+
- **Hardware Type:** Google Cloud TPU Pods
|
122 |
+
- **Hours used:** More information needed
|
123 |
+
- **Cloud Provider:** GCP
|
124 |
+
- **Compute Region:** More information needed
|
125 |
+
- **Carbon Emitted:** More information needed
|
126 |
+
|
127 |
+
# Citation
|
128 |
+
|
129 |
+
**BibTeX:**
|
130 |
+
|
131 |
+
```bibtex
|
132 |
+
@article{2020t5,
|
133 |
+
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
|
134 |
+
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
|
135 |
+
journal = {Journal of Machine Learning Research},
|
136 |
+
year = {2020},
|
137 |
+
volume = {21},
|
138 |
+
number = {140},
|
139 |
+
pages = {1-67},
|
140 |
+
url = {http://jmlr.org/papers/v21/20-074.html}
|
141 |
+
}
|
142 |
+
```
|
143 |
+
|
144 |
+
**APA:**
|
145 |
+
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), 1-67.
|
146 |
+
|
147 |
+
# Model Card Authors
|
148 |
+
|
149 |
+
This model card was written by the team at Hugging Face.
|
150 |
+
|
151 |
+
# How to Get Started with the Model
|
152 |
+
|
153 |
+
Use the code below to get started with the model.
|
154 |
+
|
155 |
+
<details>
|
156 |
+
<summary> Click to expand </summary>
|
157 |
+
|
158 |
+
```python
|
159 |
+
from transformers import T5Tokenizer, T5Model
|
160 |
+
|
161 |
+
tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
162 |
+
model = T5Model.from_pretrained("t5-small")
|
163 |
+
|
164 |
+
input_ids = tokenizer(
|
165 |
+
"Studies have been shown that owning a dog is good for you", return_tensors="pt"
|
166 |
+
).input_ids # Batch size 1
|
167 |
+
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
|
168 |
+
|
169 |
+
# forward pass
|
170 |
+
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
|
171 |
+
last_hidden_states = outputs.last_hidden_state
|
172 |
+
```
|
173 |
+
|
174 |
+
See the [Hugging Face T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model) docs and a [Colab Notebook](https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/main/notebooks/t5-trivia.ipynb) created by the model developers for more examples.
|
175 |
+
</details>
|
tests/cards/xlm-roberta-base.md
ADDED
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# XLM-RoBERTa (base-sized model)
|
2 |
+
|
3 |
+
XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
|
4 |
+
|
5 |
+
Disclaimer: The team releasing XLM-RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team.
|
6 |
+
|
7 |
+
## Model description
|
8 |
+
|
9 |
+
XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
|
10 |
+
|
11 |
+
RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
|
12 |
+
|
13 |
+
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
|
14 |
+
|
15 |
+
This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa model as inputs.
|
16 |
+
|
17 |
+
## Intended uses & limitations
|
18 |
+
|
19 |
+
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=xlm-roberta) to look for fine-tuned versions on a task that interests you.
|
20 |
+
|
21 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2.
|
22 |
+
|
23 |
+
## Usage
|
24 |
+
|
25 |
+
You can use this model directly with a pipeline for masked language modeling:
|
26 |
+
|
27 |
+
```python
|
28 |
+
>>> from transformers import pipeline
|
29 |
+
>>> unmasker = pipeline('fill-mask', model='xlm-roberta-base')
|
30 |
+
>>> unmasker("Hello I'm a <mask> model.")
|
31 |
+
|
32 |
+
[{'score': 0.10563907772302628,
|
33 |
+
'sequence': "Hello I'm a fashion model.",
|
34 |
+
'token': 54543,
|
35 |
+
'token_str': 'fashion'},
|
36 |
+
{'score': 0.08015287667512894,
|
37 |
+
'sequence': "Hello I'm a new model.",
|
38 |
+
'token': 3525,
|
39 |
+
'token_str': 'new'},
|
40 |
+
{'score': 0.033413201570510864,
|
41 |
+
'sequence': "Hello I'm a model model.",
|
42 |
+
'token': 3299,
|
43 |
+
'token_str': 'model'},
|
44 |
+
{'score': 0.030217764899134636,
|
45 |
+
'sequence': "Hello I'm a French model.",
|
46 |
+
'token': 92265,
|
47 |
+
'token_str': 'French'},
|
48 |
+
{'score': 0.026436051353812218,
|
49 |
+
'sequence': "Hello I'm a sexy model.",
|
50 |
+
'token': 17473,
|
51 |
+
'token_str': 'sexy'}]
|
52 |
+
```
|
53 |
+
|
54 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
55 |
+
|
56 |
+
```python
|
57 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
58 |
+
|
59 |
+
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
|
60 |
+
model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base")
|
61 |
+
|
62 |
+
# prepare input
|
63 |
+
text = "Replace me by any text you'd like."
|
64 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
65 |
+
|
66 |
+
# forward pass
|
67 |
+
output = model(**encoded_input)
|
68 |
+
```
|
69 |
+
|
70 |
+
### BibTeX entry and citation info
|
71 |
+
|
72 |
+
```bibtex
|
73 |
+
@article{DBLP:journals/corr/abs-1911-02116,
|
74 |
+
author = {Alexis Conneau and
|
75 |
+
Kartikay Khandelwal and
|
76 |
+
Naman Goyal and
|
77 |
+
Vishrav Chaudhary and
|
78 |
+
Guillaume Wenzek and
|
79 |
+
Francisco Guzm{\'{a}}n and
|
80 |
+
Edouard Grave and
|
81 |
+
Myle Ott and
|
82 |
+
Luke Zettlemoyer and
|
83 |
+
Veselin Stoyanov},
|
84 |
+
title = {Unsupervised Cross-lingual Representation Learning at Scale},
|
85 |
+
journal = {CoRR},
|
86 |
+
volume = {abs/1911.02116},
|
87 |
+
year = {2019},
|
88 |
+
url = {http://arxiv.org/abs/1911.02116},
|
89 |
+
eprinttype = {arXiv},
|
90 |
+
eprint = {1911.02116},
|
91 |
+
timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
|
92 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
|
93 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
94 |
+
}
|
95 |
+
```
|
96 |
+
|
97 |
+
<a href="https://huggingface.co/exbert/?model=xlm-roberta-base">
|
98 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
99 |
+
</a>
|
tests/cards/xlm-roberta-large.md
ADDED
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# XLM-RoBERTa (large-sized model)
|
2 |
+
|
3 |
+
XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
|
4 |
+
|
5 |
+
Disclaimer: The team releasing XLM-RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team.
|
6 |
+
|
7 |
+
## Model description
|
8 |
+
|
9 |
+
XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
|
10 |
+
|
11 |
+
RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
|
12 |
+
|
13 |
+
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
|
14 |
+
|
15 |
+
This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa model as inputs.
|
16 |
+
|
17 |
+
## Intended uses & limitations
|
18 |
+
|
19 |
+
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=xlm-roberta) to look for fine-tuned versions on a task that interests you.
|
20 |
+
|
21 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2.
|
22 |
+
|
23 |
+
## Usage
|
24 |
+
|
25 |
+
You can use this model directly with a pipeline for masked language modeling:
|
26 |
+
|
27 |
+
```python
|
28 |
+
>>> from transformers import pipeline
|
29 |
+
>>> unmasker = pipeline('fill-mask', model='xlm-roberta-large')
|
30 |
+
>>> unmasker("Hello I'm a <mask> model.")
|
31 |
+
|
32 |
+
[{'score': 0.10563907772302628,
|
33 |
+
'sequence': "Hello I'm a fashion model.",
|
34 |
+
'token': 54543,
|
35 |
+
'token_str': 'fashion'},
|
36 |
+
{'score': 0.08015287667512894,
|
37 |
+
'sequence': "Hello I'm a new model.",
|
38 |
+
'token': 3525,
|
39 |
+
'token_str': 'new'},
|
40 |
+
{'score': 0.033413201570510864,
|
41 |
+
'sequence': "Hello I'm a model model.",
|
42 |
+
'token': 3299,
|
43 |
+
'token_str': 'model'},
|
44 |
+
{'score': 0.030217764899134636,
|
45 |
+
'sequence': "Hello I'm a French model.",
|
46 |
+
'token': 92265,
|
47 |
+
'token_str': 'French'},
|
48 |
+
{'score': 0.026436051353812218,
|
49 |
+
'sequence': "Hello I'm a sexy model.",
|
50 |
+
'token': 17473,
|
51 |
+
'token_str': 'sexy'}]
|
52 |
+
```
|
53 |
+
|
54 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
55 |
+
|
56 |
+
```python
|
57 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
58 |
+
|
59 |
+
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large')
|
60 |
+
model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-large")
|
61 |
+
|
62 |
+
# prepare input
|
63 |
+
text = "Replace me by any text you'd like."
|
64 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
65 |
+
|
66 |
+
# forward pass
|
67 |
+
output = model(**encoded_input)
|
68 |
+
```
|
69 |
+
|
70 |
+
### BibTeX entry and citation info
|
71 |
+
|
72 |
+
```bibtex
|
73 |
+
@article{DBLP:journals/corr/abs-1911-02116,
|
74 |
+
author = {Alexis Conneau and
|
75 |
+
Kartikay Khandelwal and
|
76 |
+
Naman Goyal and
|
77 |
+
Vishrav Chaudhary and
|
78 |
+
Guillaume Wenzek and
|
79 |
+
Francisco Guzm{\'{a}}n and
|
80 |
+
Edouard Grave and
|
81 |
+
Myle Ott and
|
82 |
+
Luke Zettlemoyer and
|
83 |
+
Veselin Stoyanov},
|
84 |
+
title = {Unsupervised Cross-lingual Representation Learning at Scale},
|
85 |
+
journal = {CoRR},
|
86 |
+
volume = {abs/1911.02116},
|
87 |
+
year = {2019},
|
88 |
+
url = {http://arxiv.org/abs/1911.02116},
|
89 |
+
eprinttype = {arXiv},
|
90 |
+
eprint = {1911.02116},
|
91 |
+
timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
|
92 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
|
93 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
94 |
+
}
|
95 |
+
```
|
96 |
+
|
97 |
+
<a href="https://huggingface.co/exbert/?model=xlm-roberta-base">
|
98 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
99 |
+
</a>
|
tests/cards/yiyanghkust___finbert-tone.md
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
`FinBERT` is a BERT model pre-trained on financial communication text. The purpose is to enhance financial NLP research and practice. It is trained on the following three financial communication corpus. The total corpora size is 4.9B tokens.
|
2 |
+
- Corporate Reports 10-K & 10-Q: 2.5B tokens
|
3 |
+
- Earnings Call Transcripts: 1.3B tokens
|
4 |
+
- Analyst Reports: 1.1B tokens
|
5 |
+
|
6 |
+
More technical details on `FinBERT`: [Click Link](https://github.com/yya518/FinBERT)
|
7 |
+
|
8 |
+
This released `finbert-tone` model is the `FinBERT` model fine-tuned on 10,000 manually annotated (positive, negative, neutral) sentences from analyst reports. This model achieves superior performance on financial tone analysis task. If you are simply interested in using `FinBERT` for financial tone analysis, give it a try.
|
9 |
+
|
10 |
+
If you use the model in your academic work, please cite the following paper:
|
11 |
+
|
12 |
+
Huang, Allen H., Hui Wang, and Yi Yang. "FinBERT: A Large Language Model for Extracting Information from Financial Text." *Contemporary Accounting Research* (2022).
|
13 |
+
|
14 |
+
|
15 |
+
# How to use
|
16 |
+
You can use this model with Transformers pipeline for sentiment analysis.
|
17 |
+
```python
|
18 |
+
from transformers import BertTokenizer, BertForSequenceClassification
|
19 |
+
from transformers import pipeline
|
20 |
+
|
21 |
+
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
|
22 |
+
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
|
23 |
+
|
24 |
+
nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer)
|
25 |
+
|
26 |
+
sentences = ["there is a shortage of capital, and we need extra financing",
|
27 |
+
"growth is strong and we have plenty of liquidity",
|
28 |
+
"there are doubts about our finances",
|
29 |
+
"profits are flat"]
|
30 |
+
results = nlp(sentences)
|
31 |
+
print(results) #LABEL_0: neutral; LABEL_1: positive; LABEL_2: negative
|
32 |
+
|
33 |
+
```
|
tests/conftest.py
CHANGED
@@ -1,7 +1,62 @@
|
|
1 |
-
import
|
2 |
-
from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
|
4 |
-
@pytest.fixture()
|
5 |
-
def bloom_card():
|
6 |
-
# TODO: Note, this is a heavily doctored version of the card.
|
7 |
-
return bc
|
|
|
1 |
+
from os import listdir
|
2 |
+
from os.path import isfile, join
|
3 |
+
from pathlib import Path
|
4 |
+
|
5 |
+
|
6 |
+
# TODO: I have the option of maybe making a check for accuracy/metrics?
|
7 |
+
|
8 |
+
# Intended Purpose, General Limitations, Computational Requirements
|
9 |
+
expected_check_results = {
|
10 |
+
"albert-base-v2": [True, True, False],
|
11 |
+
"bert-base-cased": [True, True, False],
|
12 |
+
"bert-base-multilingual-cased": [True, True, False],
|
13 |
+
"bert-base-uncased": [True, True, False],
|
14 |
+
"cl-tohoku___bert-base-japanese-whole-word-masking": [False, False, False],
|
15 |
+
"distilbert-base-cased-distilled-squad": [True, True, True],
|
16 |
+
"distilbert-base-uncased": [True, True, False],
|
17 |
+
"distilbert-base-uncased-finetuned-sst-2-english": [True, True, False],
|
18 |
+
"distilroberta-base": [True, True, False],
|
19 |
+
"emilyalsentzer___Bio_ClinicalBERT": [False, False, False],
|
20 |
+
"facebook___bart-large-mnli": [False, False, False],
|
21 |
+
"google___electra-base-discriminator": [False, False, False],
|
22 |
+
"gpt2": [True, True, False],
|
23 |
+
"Helsinki-NLP___opus-mt-en-es": [False, False, False],
|
24 |
+
"jonatasgrosman___wav2vec2-large-xlsr-53-english": [False, False, False],
|
25 |
+
"microsoft___layoutlmv3-base": [True, False, False],
|
26 |
+
"openai___clip-vit-base-patch32": [True, True, False],
|
27 |
+
"openai___clip-vit-large-patch14": [True, True, False],
|
28 |
+
"philschmid___bart-large-cnn-samsum": [False, False, False],
|
29 |
+
"prajjwal1___bert-tiny": [False, False, False],
|
30 |
+
"roberta-base": [True, True, True], # For the computational requirements, sort of?
|
31 |
+
"roberta-large": [True, True, True],
|
32 |
+
"runwayml___stable-diffusion-v1-5": [True, True, True],
|
33 |
+
"sentence-transformers___all-MiniLM-L6-v2": [True, False, False],
|
34 |
+
"StanfordAIMI___stanford-deidentifier-base": [False, False, False],
|
35 |
+
"t5-base": [True, False, False],
|
36 |
+
"t5-small": [True, False, False],
|
37 |
+
"xlm-roberta-base": [True, False, False],
|
38 |
+
"xlm-roberta-large": [True, False, False],
|
39 |
+
"yiyanghkust___finbert-tone": [True, False, False],
|
40 |
+
}
|
41 |
+
|
42 |
+
|
43 |
+
def pytest_generate_tests(metafunc):
|
44 |
+
if "real_model_card" in metafunc.fixturenames:
|
45 |
+
files = [f"cards/{f}" for f in listdir("cards") if isfile(join("cards", f))]
|
46 |
+
cards = [Path(f).read_text() for f in files]
|
47 |
+
model_ids = [f.replace("cards/", "").replace(".md", "") for f in files]
|
48 |
+
|
49 |
+
# TODO: IMPORTANT – remove the default [False, False, False]
|
50 |
+
expected_results = [expected_check_results.get(m, [False, False, False]) for m, c in zip(model_ids, cards)]
|
51 |
+
|
52 |
+
metafunc.parametrize(
|
53 |
+
["real_model_card", "expected_check_results"],
|
54 |
+
list(map(list, zip(cards, expected_results)))
|
55 |
+
)
|
56 |
+
|
57 |
+
# rows = read_csvrows()
|
58 |
+
# if 'row' in metafunc.fixturenames:
|
59 |
+
# metafunc.parametrize('row', rows)
|
60 |
+
# if 'col' in metafunc.fixturenames:
|
61 |
+
# metafunc.parametrize('col', list(itertools.chain(*rows)))
|
62 |
|
|
|
|
|
|
|
|
tests/test_compliance_checks.py
CHANGED
@@ -283,67 +283,13 @@ class TestComplianceSuite:
|
|
283 |
check.run_check.assert_called_once()
|
284 |
|
285 |
|
286 |
-
|
287 |
-
|
288 |
-
(
|
289 |
-
|
290 |
-
|
291 |
-
Some random info...
|
292 |
-
|
293 |
-
## Model Details
|
294 |
-
|
295 |
-
### Model Description
|
296 |
-
|
297 |
-
- **Developed by:** Nima Boscarino
|
298 |
-
- **Model type:** Yada yada yada
|
299 |
-
|
300 |
-
## Uses
|
301 |
-
|
302 |
-
### Direct Use
|
303 |
-
|
304 |
-
Here is some info about direct uses...
|
305 |
-
|
306 |
-
### Downstream Use [optional]
|
307 |
-
|
308 |
-
[More Information Needed]
|
309 |
-
|
310 |
-
### Out-of-Scope Use
|
311 |
-
|
312 |
-
Here is some info about out-of-scope uses...
|
313 |
-
|
314 |
-
## Bias, Risks, and Limitations
|
315 |
-
|
316 |
-
Hello world! These are some risks...
|
317 |
-
|
318 |
-
## Technical Specifications
|
319 |
-
|
320 |
-
### Compute infrastructure
|
321 |
-
Jean Zay Public Supercomputer, provided by the French government.
|
322 |
-
|
323 |
-
#### Hardware
|
324 |
-
|
325 |
-
* 384 A100 80GB GPUs (48 nodes)
|
326 |
-
|
327 |
-
#### Software
|
328 |
-
|
329 |
-
* Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
|
330 |
-
</details>
|
331 |
-
|
332 |
-
## More Things
|
333 |
-
""", False),
|
334 |
-
("bloom_card", True)
|
335 |
])
|
336 |
-
def test_end_to_end_compliance_suite(self, card, fixture, request):
|
337 |
-
if fixture:
|
338 |
-
card = request.getfixturevalue(card)
|
339 |
-
|
340 |
-
suite = ComplianceSuite(checks=[
|
341 |
-
ModelProviderIdentityCheck(),
|
342 |
-
IntendedPurposeCheck(),
|
343 |
-
GeneralLimitationsCheck(),
|
344 |
-
ComputationalRequirementsCheck()
|
345 |
-
])
|
346 |
|
347 |
-
|
348 |
|
349 |
-
|
|
|
283 |
check.run_check.assert_called_once()
|
284 |
|
285 |
|
286 |
+
def test_end_to_end_compliance_suite(real_model_card, expected_check_results):
|
287 |
+
suite = ComplianceSuite(checks=[
|
288 |
+
IntendedPurposeCheck(),
|
289 |
+
GeneralLimitationsCheck(),
|
290 |
+
ComputationalRequirementsCheck()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
291 |
])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
292 |
|
293 |
+
results = suite.run(real_model_card)
|
294 |
|
295 |
+
assert all([r.status == e for r, e in zip(results, expected_check_results)])
|