victormiller commited on
Commit
66a1161
·
verified ·
1 Parent(s): 081c95c

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +6 -7
web.py CHANGED
@@ -249,7 +249,7 @@ def web_data():
249
  P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
250
  Ul(
251
  Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
252
- Li("Document Preperation", style = "margin-bottom: 5px"),
253
  Li("Line-Level Filtering", style = "margin-bottom: 5px"),
254
  Li("Local Deduplication", style = "margin-bottom: 5px"),
255
  Li("Each section is complete with code and comparisons to Dolma,", D_cite(bibtex_key="soldaini2024dolma"),
@@ -263,9 +263,9 @@ def web_data():
263
  H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
264
  P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
265
  table_div_filter_data,
266
- P("The table below provides a comparison of the quality filters that have been applied to each dataset. Of note, TxT360 does not use any machine learning (ML) based filters. ML filters are a useful and effecient filtering processing that should be consider for any filtering project. However, we are leaving that option to TxT360's end users."),
267
  table_div_qf_filter_data,
268
- P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
269
  Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
270
  P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
271
  id="section2",),
@@ -278,9 +278,8 @@ def web_data():
278
  WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
279
  WET files contain plaintexts extracted by Common Crawl. In line with previous works""",D_cite(bibtex_key="thepile"),D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="gopher"),D_cite(bibtex_key="fineweb") ,""" ,
280
  we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
281
- Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
282
  """),
283
- P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
284
 
285
  Details(
286
  Summary("Text Extraction Examples"),
@@ -338,7 +337,7 @@ def web_data():
338
 
339
  P(B("URL Filtering: "), """
340
  The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
341
- out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
342
  """),
343
 
344
  P(B("URL Blocklist: "), """
@@ -579,7 +578,7 @@ def web_data():
579
  work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
580
  """),
581
  P(B("Fraction of Characters in Repeated Lines: "), """
582
- Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing mupltiple, short duplicate passages, as well as those with few,
583
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
584
  that are duplicates, and the fraction of characters contained within those duplicated passages.
585
  """),
 
249
  P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
250
  Ul(
251
  Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
252
+ Li("Document Preparation", style = "margin-bottom: 5px"),
253
  Li("Line-Level Filtering", style = "margin-bottom: 5px"),
254
  Li("Local Deduplication", style = "margin-bottom: 5px"),
255
  Li("Each section is complete with code and comparisons to Dolma,", D_cite(bibtex_key="soldaini2024dolma"),
 
263
  H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
264
  P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
265
  table_div_filter_data,
266
+ P("The table below provides a comparison of the quality filters that have been applied to each dataset. Of note, TxT360 does not use any machine learning (ML) based filters. ML filters are a useful and efficient filtering processing that should be consider for any filtering project. However, we are leaving this to future work."),
267
  table_div_qf_filter_data,
268
+ P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across snapshots. "),
269
  Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
270
  P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
271
  id="section2",),
 
278
  WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
279
  WET files contain plaintexts extracted by Common Crawl. In line with previous works""",D_cite(bibtex_key="thepile"),D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="gopher"),D_cite(bibtex_key="fineweb") ,""" ,
280
  we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
 
281
  """),
282
+ P("We directly read WARC files with the warcio library instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
283
 
284
  Details(
285
  Summary("Text Extraction Examples"),
 
337
 
338
  P(B("URL Filtering: "), """
339
  The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
340
+ out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated data (e.g. wikipedia.org) to avoid duplication.
341
  """),
342
 
343
  P(B("URL Blocklist: "), """
 
578
  work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
579
  """),
580
  P(B("Fraction of Characters in Repeated Lines: "), """
581
+ Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing multiple, short duplicate passages, as well as those with few,
582
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
583
  that are duplicates, and the fraction of characters contained within those duplicated passages.
584
  """),