victormiller commited on
Commit
a810b7b
·
verified ·
1 Parent(s): 03ad039

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +17 -16
web.py CHANGED
@@ -432,7 +432,7 @@ def web_data():
432
  P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
433
 
434
  Details(
435
- Summary("Open Me - WARC TEST"),
436
  DV2("data/sample_wet.json", "data/sample_warc.json", 3),
437
  ),
438
  #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
@@ -442,16 +442,16 @@ def web_data():
442
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
443
  This step removes over 60% of the whole data.
444
  """),
445
- DV(
446
- "data/sample_non_en.json",
447
- 3,
448
- "Sample documents that are classified as non-English",
449
- ),
450
- DV(
451
- "data/sample_en_low.json",
452
- 3,
453
- "Sample documents that are classified as English but with score less than 0.65",
454
- ),
455
  H4("1.3 URL Filtering"),
456
  P("""
457
  Following RefinedWeb [3], we use a manually inspected URL blocklist to filter fraudulent and/or adult websites.
@@ -463,6 +463,7 @@ def web_data():
463
  articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
464
  4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
465
  """),
 
466
  DVS(urls_high_matches, "24 URL domains with more than 4k matches"),
467
  P("""
468
  We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
@@ -481,11 +482,11 @@ def web_data():
481
  non_web_urls,
482
  "curated url domains that are excluded from our dataset",
483
  ),
484
- DV(
485
- "data/sample_url_exclusion.json",
486
- 0,
487
- "Sample documents whose urls are in our curated url domain list",
488
- ),
489
  H3("2. Line-Level Removal"),
490
  P("""
491
  Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
 
432
  P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
433
 
434
  Details(
435
+ Summary("Text Extraction Examples"),
436
  DV2("data/sample_wet.json", "data/sample_warc.json", 3),
437
  ),
438
  #DV2("data/sample_wet.json", "data/sample_warc.json", 3),
 
442
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
443
  This step removes over 60% of the whole data.
444
  """),
445
+ Details(
446
+ Summary("Sample documents that are classified as non-English"),
447
+ DV("data/sample_non_en.json", 3),
448
+ ),
449
+
450
+ Details(
451
+ Summary("Sample documents that are classified as English but with score less than 0.65"),
452
+ DV("data/sample_en_low.json",3),
453
+ ),
454
+
455
  H4("1.3 URL Filtering"),
456
  P("""
457
  Following RefinedWeb [3], we use a manually inspected URL blocklist to filter fraudulent and/or adult websites.
 
463
  articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
464
  4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
465
  """),
466
+
467
  DVS(urls_high_matches, "24 URL domains with more than 4k matches"),
468
  P("""
469
  We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
 
482
  non_web_urls,
483
  "curated url domains that are excluded from our dataset",
484
  ),
485
+
486
+ Details(
487
+ Summary("Sample documents whose urls are in our curated url domain list"),
488
+ DV("data/sample_url_exclusion.json", 0,),
489
+ ),
490
  H3("2. Line-Level Removal"),
491
  P("""
492
  Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level