facebook
/

opt-350m

@@ -88,10 +88,10 @@ Roller et al. (2021)
   - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News
 dataset that was used in RoBERTa (Liu et al., 2019b)
-* The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
 to each dataset’s size in the pretraining corpus.
-* The dataset might contains offensive content as parts of the dataset are a subset of
 public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
 that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.

   - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News
 dataset that was used in RoBERTa (Liu et al., 2019b)
+The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
 to each dataset’s size in the pretraining corpus.
+The dataset might contains offensive content as parts of the dataset are a subset of
 public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
 that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.