Update README.md
Browse files
README.md
CHANGED
|
@@ -88,10 +88,10 @@ Roller et al. (2021)
|
|
| 88 |
- CCNewsV2 containing an updated version of the English portion of the CommonCrawl News
|
| 89 |
dataset that was used in RoBERTa (Liu et al., 2019b)
|
| 90 |
|
| 91 |
-
|
| 92 |
to each dataset’s size in the pretraining corpus.
|
| 93 |
|
| 94 |
-
|
| 95 |
public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
|
| 96 |
that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
|
| 97 |
|
|
|
|
| 88 |
- CCNewsV2 containing an updated version of the English portion of the CommonCrawl News
|
| 89 |
dataset that was used in RoBERTa (Liu et al., 2019b)
|
| 90 |
|
| 91 |
+
The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
|
| 92 |
to each dataset’s size in the pretraining corpus.
|
| 93 |
|
| 94 |
+
The dataset might contains offensive content as parts of the dataset are a subset of
|
| 95 |
public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
|
| 96 |
that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
|
| 97 |
|