hynky HF staff commited on
Commit
548ee1d
·
1 Parent(s): 2322af0

remove img width

Browse files
Files changed (1) hide show
  1. index.html +20 -39
index.html CHANGED
@@ -151,11 +151,9 @@
151
  <style>
152
  .neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
153
  </style>
154
- <div class="neighborhood-figure-container">
155
- <figure class="image">
156
- <img style="width:708px" src="plots/fineweb-recipe.png"/>
157
- </figure>
158
- </div>
159
  <h2>Starting point: text extraction</h2>
160
  <p>CommonCrawl data is available in two main formats: WARC
161
  and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
@@ -175,8 +173,7 @@
175
  resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
176
  quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
177
  these additional tokens on the WET files are unnecessary page boilerplate.</p>
178
- <figure class="image"><a href="plots/wet_comparison.png"><img
179
- style="width:640px" src="plots/wet_comparison.png"/></a></figure>
180
 
181
  <h2>Base filtering</h2>
182
  <p>Filtering is an important part of the curation process. It
@@ -233,8 +230,7 @@
233
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
234
  buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
235
  <figure class="image"><a
236
- href="plots/minhash_parameters_comparison.png"><img style="width:567px"
237
- src="plots/minhash_parameters_comparison.png"/></a>
238
  </figure>
239
  <p>While the high number of hash functions in RefinedWeb
240
  allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
@@ -254,8 +250,7 @@
254
  trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
255
  tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
256
  green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
257
- <figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img
258
- style="width:576px" src="plots/dedup_all_dumps_bad.png"/></a></figure>
259
  <p>This was quite puzzling as our intuition regarding web
260
  data was that more deduplication would always result in improved performance. We decided to take a closer
261
  look at one of the oldest dumps, dump 2013-48:</p>
@@ -281,8 +276,7 @@
281
  </li>
282
  </ul>
283
  <figure class="image"><a
284
- href="plots/removed_data_cross_dedup.png"><img style="width:576px"
285
- src="plots/removed_data_cross_dedup.png"/></a></figure>
286
  <p>These results show that, for this older dump where we were
287
  removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
288
  removed (considered independently from all the other dumps).</p>
@@ -293,8 +287,7 @@
293
  <p>When training on a random sample from this dataset we see
294
  that it now matches RefinedWeb’s performance (blue and red curves below):</p>
295
  <figure class="image"><a
296
- href="plots/cross_ind_unfiltered_comparison.png"><img style="width:576px"
297
- src="plots/cross_ind_unfiltered_comparison.png"/></a>
298
  </figure>
299
  <p>We hypothesis that the main improvement gained from
300
  deduplication is the removal of very large clusters that are present in every single dump (you will find
@@ -342,8 +335,7 @@
342
  <p>We then simulated uniformly sampling documents from this
343
  entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
344
  below you can see how often each document would be repeated.</p>
345
- <figure class="image"><a href="plots/dedup_impact_simulation.png"><img
346
- style="width:708px" src="plots/dedup_impact_simulation.png"/></a></figure>
347
  <p>For 1B almost all documents would be unique
348
  (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
349
  dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
@@ -383,8 +375,7 @@
383
  <p>The performance of the models trained on each of these was
384
  consistently worse (even if to different degrees) than that of the original independently deduplicated
385
  data:</p>
386
- <figure class="image"><a href="plots/Untitled.png"><img
387
- style="width:708px" src="plots/Untitled.png"/></a></figure>
388
  <h2>Additional filtering</h2>
389
  <p>By this point we had reached the same performance as
390
  RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
@@ -405,8 +396,7 @@
405
  <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
406
  each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
407
  (plot smoothed with a 3 checkpoints sliding window):</p>
408
- <figure class="image"><a href="plots/c4_filters.png"><img
409
- style="width:708px" src="plots/c4_filters.png"/></a></figure>
410
  <ul class="bulleted-list">
411
  <li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
412
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
@@ -454,8 +444,7 @@
454
  caused by lower quality data on the full dedup version, we inspected histograms and manually defined
455
  thresholds for the metrics where these differences were starker. This process yielded 17 candidate
456
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
457
- <figure class="image"><a href="plots/Untitled%201.png"><img
458
- style="width:790px" src="plots/Untitled%201.png"/></a></figure>
459
 
460
  <p>To assess the effectiveness of these newly created
461
  filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
@@ -479,8 +468,7 @@
479
  <ul class="bulleted-list">
480
  <li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
481
  </ul>
482
- <figure class="image"><a href="plots/Untitled%202.png"><img
483
- style="width:708px" src="plots/Untitled%202.png"/></a></figure>
484
  <hr />
485
  <h1>The final dataset</h1>
486
  <p>The final FineWeb dataset comprises 15T tokens and
@@ -498,8 +486,7 @@
498
  <ul class="bulleted-list">
499
  <li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
500
  </ul>
501
- <figure class="image"><a href="plots/fineweb_all_filters.png"><img
502
- style="width:708px" src="plots/fineweb_all_filters.png"/></a></figure>
503
  <p>We compared 🍷 FineWeb with the following datasets:</p>
504
  <ul class="bulleted-list">
505
  <li style="list-style-type:disc"><a
@@ -533,12 +520,10 @@
533
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
534
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
535
  results here</a>.</p>
536
- <figure class="image"><a href="plots/fineweb_ablations.png"><img
537
- style="width:708px" src="plots/fineweb_ablations.png"/></a></figure>
538
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
539
  FineWeb:</p>
540
- <figure class="image"><a href="plots/Untitled%203.png"><img
541
- style="width:4587px" src="plots/Untitled%203.png"/></a></figure>
542
  <hr />
543
  <h1>Just like fine wine, not all crawls are created
544
  equal</h1>
@@ -549,8 +534,7 @@
549
  the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
550
  <p>The plot below clearly shows that some dumps perform far
551
  worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
552
- <figure class="image"><a href="plots/score_by_dump.png"><img
553
- style="width:708px" src="plots/score_by_dump.png"/></a></figure>
554
  <p>We identified 5 main relevant time intervals:</p>
555
  <ul class="bulleted-list">
556
  <li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
@@ -592,8 +576,7 @@
592
  crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
593
  it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
594
  that alternatively new FQDNs were added to the top 60k.</p>
595
- <figure class="image"><a href="plots/Untitled%204.png"><img
596
- style="width:5026px" src="plots/Untitled%204.png"/></a></figure>
597
  <p>The data indicates three significant changes:
598
  2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
599
  <p>The explanation for the changes between 2022-33/2022-40
@@ -625,8 +608,7 @@
625
  not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
626
  dumps, one would expect these frequencies to remain approximately constant over time.</p>
627
  <p>The results are shown in the following graph:</p>
628
- <figure class="image"><a href="plots/Untitled%205.png"><img
629
- style="width:4156px" src="plots/Untitled%205.png"/></a></figure>
630
  <p>While the frequency remained approximately constant until
631
  2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
632
  in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
@@ -641,8 +623,7 @@
641
  evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
642
  of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
643
 
644
- <figure class="image"><a href="plots/Untitled%206.png"><img
645
- style="width:708px" src="plots/Untitled%206.png"/></a></figure>
646
  <hr />
647
  <h1>Next steps</h1>
648
  <p>We want to continue improving FineWeb and will also
 
151
  <style>
152
  .neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
153
  </style>
154
+ <figure class="l-body figure">
155
+ <img src="plots/fineweb-recipe.png"/>
156
+ </figure>
 
 
157
  <h2>Starting point: text extraction</h2>
158
  <p>CommonCrawl data is available in two main formats: WARC
159
  and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
 
173
  resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
174
  quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
175
  these additional tokens on the WET files are unnecessary page boilerplate.</p>
176
+ <figure class="image"><a href="plots/wet_comparison.png"><img src="plots/wet_comparison.png"/></a></figure>
 
177
 
178
  <h2>Base filtering</h2>
179
  <p>Filtering is an important part of the curation process. It
 
230
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
231
  buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
232
  <figure class="image"><a
233
+ href="plots/minhash_parameters_comparison.png"><img src="plots/minhash_parameters_comparison.png"/></a>
 
234
  </figure>
235
  <p>While the high number of hash functions in RefinedWeb
236
  allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
 
250
  trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
251
  tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
252
  green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
253
+ <figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img src="plots/dedup_all_dumps_bad.png"/></a></figure>
 
254
  <p>This was quite puzzling as our intuition regarding web
255
  data was that more deduplication would always result in improved performance. We decided to take a closer
256
  look at one of the oldest dumps, dump 2013-48:</p>
 
276
  </li>
277
  </ul>
278
  <figure class="image"><a
279
+ href="plots/removed_data_cross_dedup.png"><img src="plots/removed_data_cross_dedup.png"/></a></figure>
 
280
  <p>These results show that, for this older dump where we were
281
  removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
282
  removed (considered independently from all the other dumps).</p>
 
287
  <p>When training on a random sample from this dataset we see
288
  that it now matches RefinedWeb’s performance (blue and red curves below):</p>
289
  <figure class="image"><a
290
+ href="plots/cross_ind_unfiltered_comparison.png"><img src="plots/cross_ind_unfiltered_comparison.png"/></a>
 
291
  </figure>
292
  <p>We hypothesis that the main improvement gained from
293
  deduplication is the removal of very large clusters that are present in every single dump (you will find
 
335
  <p>We then simulated uniformly sampling documents from this
336
  entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
337
  below you can see how often each document would be repeated.</p>
338
+ <figure class="image"><a href="plots/dedup_impact_simulation.png"><img src="plots/dedup_impact_simulation.png"/></a></figure>
 
339
  <p>For 1B almost all documents would be unique
340
  (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
341
  dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
 
375
  <p>The performance of the models trained on each of these was
376
  consistently worse (even if to different degrees) than that of the original independently deduplicated
377
  data:</p>
378
+ <figure class="image"><a href="plots/Untitled.png"><img src="plots/Untitled.png"/></a></figure>
 
379
  <h2>Additional filtering</h2>
380
  <p>By this point we had reached the same performance as
381
  RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
 
396
  <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
397
  each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
398
  (plot smoothed with a 3 checkpoints sliding window):</p>
399
+ <figure class="image"><a href="plots/c4_filters.png"><img src="plots/c4_filters.png"/></a></figure>
 
400
  <ul class="bulleted-list">
401
  <li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
402
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
 
444
  caused by lower quality data on the full dedup version, we inspected histograms and manually defined
445
  thresholds for the metrics where these differences were starker. This process yielded 17 candidate
446
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
447
+ <figure class="image"><a href="plots/Untitled%201.png"><img src="plots/Untitled%201.png"/></a></figure>
 
448
 
449
  <p>To assess the effectiveness of these newly created
450
  filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
 
468
  <ul class="bulleted-list">
469
  <li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
470
  </ul>
471
+ <figure class="image"><a href="plots/Untitled%202.png"><img src="plots/Untitled%202.png"/></a></figure>
 
472
  <hr />
473
  <h1>The final dataset</h1>
474
  <p>The final FineWeb dataset comprises 15T tokens and
 
486
  <ul class="bulleted-list">
487
  <li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
488
  </ul>
489
+ <figure class="image"><a href="plots/fineweb_all_filters.png"><img src="plots/fineweb_all_filters.png"/></a></figure>
 
490
  <p>We compared 🍷 FineWeb with the following datasets:</p>
491
  <ul class="bulleted-list">
492
  <li style="list-style-type:disc"><a
 
520
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
521
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
522
  results here</a>.</p>
523
+ <figure class="image"><a href="plots/fineweb_ablations.png"><img src="plots/fineweb_ablations.png"/></a></figure>
 
524
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
525
  FineWeb:</p>
526
+ <figure class="image"><a href="plots/Untitled%203.png"><img src="plots/Untitled%203.png"/></a></figure>
 
527
  <hr />
528
  <h1>Just like fine wine, not all crawls are created
529
  equal</h1>
 
534
  the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
535
  <p>The plot below clearly shows that some dumps perform far
536
  worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
537
+ <figure class="image"><a href="plots/score_by_dump.png"><img src="plots/score_by_dump.png"/></a></figure>
 
538
  <p>We identified 5 main relevant time intervals:</p>
539
  <ul class="bulleted-list">
540
  <li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
 
576
  crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
577
  it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
578
  that alternatively new FQDNs were added to the top 60k.</p>
579
+ <figure class="image"><a href="plots/Untitled%204.png"><img src="plots/Untitled%204.png"/></a></figure>
 
580
  <p>The data indicates three significant changes:
581
  2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
582
  <p>The explanation for the changes between 2022-33/2022-40
 
608
  not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
609
  dumps, one would expect these frequencies to remain approximately constant over time.</p>
610
  <p>The results are shown in the following graph:</p>
611
+ <figure class="image"><a href="plots/Untitled%205.png"><img src="plots/Untitled%205.png"/></a></figure>
 
612
  <p>While the frequency remained approximately constant until
613
  2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
614
  in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
 
623
  evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
624
  of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
625
 
626
+ <figure class="image"><a href="plots/Untitled%206.png"><img src="plots/Untitled%206.png"/></a></figure>
 
627
  <hr />
628
  <h1>Next steps</h1>
629
  <p>We want to continue improving FineWeb and will also