remove img width
Browse files- index.html +20 -39
index.html
CHANGED
@@ -151,11 +151,9 @@
|
|
151 |
<style>
|
152 |
.neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
|
153 |
</style>
|
154 |
-
<
|
155 |
-
<
|
156 |
-
|
157 |
-
</figure>
|
158 |
-
</div>
|
159 |
<h2>Starting point: text extraction</h2>
|
160 |
<p>CommonCrawl data is available in two main formats: WARC
|
161 |
and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
|
@@ -175,8 +173,7 @@
|
|
175 |
resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
|
176 |
quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
|
177 |
these additional tokens on the WET files are unnecessary page boilerplate.</p>
|
178 |
-
<figure class="image"><a href="plots/wet_comparison.png"><img
|
179 |
-
style="width:640px" src="plots/wet_comparison.png"/></a></figure>
|
180 |
|
181 |
<h2>Base filtering</h2>
|
182 |
<p>Filtering is an important part of the curation process. It
|
@@ -233,8 +230,7 @@
|
|
233 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
234 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
|
235 |
<figure class="image"><a
|
236 |
-
href="plots/minhash_parameters_comparison.png"><img
|
237 |
-
src="plots/minhash_parameters_comparison.png"/></a>
|
238 |
</figure>
|
239 |
<p>While the high number of hash functions in RefinedWeb
|
240 |
allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
|
@@ -254,8 +250,7 @@
|
|
254 |
trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
|
255 |
tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
|
256 |
green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
|
257 |
-
<figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img
|
258 |
-
style="width:576px" src="plots/dedup_all_dumps_bad.png"/></a></figure>
|
259 |
<p>This was quite puzzling as our intuition regarding web
|
260 |
data was that more deduplication would always result in improved performance. We decided to take a closer
|
261 |
look at one of the oldest dumps, dump 2013-48:</p>
|
@@ -281,8 +276,7 @@
|
|
281 |
</li>
|
282 |
</ul>
|
283 |
<figure class="image"><a
|
284 |
-
href="plots/removed_data_cross_dedup.png"><img
|
285 |
-
src="plots/removed_data_cross_dedup.png"/></a></figure>
|
286 |
<p>These results show that, for this older dump where we were
|
287 |
removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
|
288 |
removed (considered independently from all the other dumps).</p>
|
@@ -293,8 +287,7 @@
|
|
293 |
<p>When training on a random sample from this dataset we see
|
294 |
that it now matches RefinedWeb’s performance (blue and red curves below):</p>
|
295 |
<figure class="image"><a
|
296 |
-
href="plots/cross_ind_unfiltered_comparison.png"><img
|
297 |
-
src="plots/cross_ind_unfiltered_comparison.png"/></a>
|
298 |
</figure>
|
299 |
<p>We hypothesis that the main improvement gained from
|
300 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
@@ -342,8 +335,7 @@
|
|
342 |
<p>We then simulated uniformly sampling documents from this
|
343 |
entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
|
344 |
below you can see how often each document would be repeated.</p>
|
345 |
-
<figure class="image"><a href="plots/dedup_impact_simulation.png"><img
|
346 |
-
style="width:708px" src="plots/dedup_impact_simulation.png"/></a></figure>
|
347 |
<p>For 1B almost all documents would be unique
|
348 |
(#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
|
349 |
dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
|
@@ -383,8 +375,7 @@
|
|
383 |
<p>The performance of the models trained on each of these was
|
384 |
consistently worse (even if to different degrees) than that of the original independently deduplicated
|
385 |
data:</p>
|
386 |
-
<figure class="image"><a href="plots/Untitled.png"><img
|
387 |
-
style="width:708px" src="plots/Untitled.png"/></a></figure>
|
388 |
<h2>Additional filtering</h2>
|
389 |
<p>By this point we had reached the same performance as
|
390 |
RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
|
@@ -405,8 +396,7 @@
|
|
405 |
<a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
|
406 |
each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
|
407 |
(plot smoothed with a 3 checkpoints sliding window):</p>
|
408 |
-
<figure class="image"><a href="plots/c4_filters.png"><img
|
409 |
-
style="width:708px" src="plots/c4_filters.png"/></a></figure>
|
410 |
<ul class="bulleted-list">
|
411 |
<li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
|
412 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
@@ -454,8 +444,7 @@
|
|
454 |
caused by lower quality data on the full dedup version, we inspected histograms and manually defined
|
455 |
thresholds for the metrics where these differences were starker. This process yielded 17 candidate
|
456 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
457 |
-
<figure class="image"><a href="plots/Untitled%201.png"><img
|
458 |
-
style="width:790px" src="plots/Untitled%201.png"/></a></figure>
|
459 |
|
460 |
<p>To assess the effectiveness of these newly created
|
461 |
filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
|
@@ -479,8 +468,7 @@
|
|
479 |
<ul class="bulleted-list">
|
480 |
<li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
|
481 |
</ul>
|
482 |
-
<figure class="image"><a href="plots/Untitled%202.png"><img
|
483 |
-
style="width:708px" src="plots/Untitled%202.png"/></a></figure>
|
484 |
<hr />
|
485 |
<h1>The final dataset</h1>
|
486 |
<p>The final FineWeb dataset comprises 15T tokens and
|
@@ -498,8 +486,7 @@
|
|
498 |
<ul class="bulleted-list">
|
499 |
<li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
|
500 |
</ul>
|
501 |
-
<figure class="image"><a href="plots/fineweb_all_filters.png"><img
|
502 |
-
style="width:708px" src="plots/fineweb_all_filters.png"/></a></figure>
|
503 |
<p>We compared 🍷 FineWeb with the following datasets:</p>
|
504 |
<ul class="bulleted-list">
|
505 |
<li style="list-style-type:disc"><a
|
@@ -533,12 +520,10 @@
|
|
533 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
534 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
535 |
results here</a>.</p>
|
536 |
-
<figure class="image"><a href="plots/fineweb_ablations.png"><img
|
537 |
-
style="width:708px" src="plots/fineweb_ablations.png"/></a></figure>
|
538 |
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and
|
539 |
FineWeb:</p>
|
540 |
-
<figure class="image"><a href="plots/Untitled%203.png"><img
|
541 |
-
style="width:4587px" src="plots/Untitled%203.png"/></a></figure>
|
542 |
<hr />
|
543 |
<h1>Just like fine wine, not all crawls are created
|
544 |
equal</h1>
|
@@ -549,8 +534,7 @@
|
|
549 |
the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
|
550 |
<p>The plot below clearly shows that some dumps perform far
|
551 |
worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
|
552 |
-
<figure class="image"><a href="plots/score_by_dump.png"><img
|
553 |
-
style="width:708px" src="plots/score_by_dump.png"/></a></figure>
|
554 |
<p>We identified 5 main relevant time intervals:</p>
|
555 |
<ul class="bulleted-list">
|
556 |
<li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
|
@@ -592,8 +576,7 @@
|
|
592 |
crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
|
593 |
it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
|
594 |
that alternatively new FQDNs were added to the top 60k.</p>
|
595 |
-
<figure class="image"><a href="plots/Untitled%204.png"><img
|
596 |
-
style="width:5026px" src="plots/Untitled%204.png"/></a></figure>
|
597 |
<p>The data indicates three significant changes:
|
598 |
2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
|
599 |
<p>The explanation for the changes between 2022-33/2022-40
|
@@ -625,8 +608,7 @@
|
|
625 |
not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
|
626 |
dumps, one would expect these frequencies to remain approximately constant over time.</p>
|
627 |
<p>The results are shown in the following graph:</p>
|
628 |
-
<figure class="image"><a href="plots/Untitled%205.png"><img
|
629 |
-
style="width:4156px" src="plots/Untitled%205.png"/></a></figure>
|
630 |
<p>While the frequency remained approximately constant until
|
631 |
2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
|
632 |
in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
|
@@ -641,8 +623,7 @@
|
|
641 |
evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
|
642 |
of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
|
643 |
|
644 |
-
<figure class="image"><a href="plots/Untitled%206.png"><img
|
645 |
-
style="width:708px" src="plots/Untitled%206.png"/></a></figure>
|
646 |
<hr />
|
647 |
<h1>Next steps</h1>
|
648 |
<p>We want to continue improving FineWeb and will also
|
|
|
151 |
<style>
|
152 |
.neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
|
153 |
</style>
|
154 |
+
<figure class="l-body figure">
|
155 |
+
<img src="plots/fineweb-recipe.png"/>
|
156 |
+
</figure>
|
|
|
|
|
157 |
<h2>Starting point: text extraction</h2>
|
158 |
<p>CommonCrawl data is available in two main formats: WARC
|
159 |
and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
|
|
|
173 |
resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
|
174 |
quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
|
175 |
these additional tokens on the WET files are unnecessary page boilerplate.</p>
|
176 |
+
<figure class="image"><a href="plots/wet_comparison.png"><img src="plots/wet_comparison.png"/></a></figure>
|
|
|
177 |
|
178 |
<h2>Base filtering</h2>
|
179 |
<p>Filtering is an important part of the curation process. It
|
|
|
230 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
231 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
|
232 |
<figure class="image"><a
|
233 |
+
href="plots/minhash_parameters_comparison.png"><img src="plots/minhash_parameters_comparison.png"/></a>
|
|
|
234 |
</figure>
|
235 |
<p>While the high number of hash functions in RefinedWeb
|
236 |
allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
|
|
|
250 |
trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
|
251 |
tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
|
252 |
green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
|
253 |
+
<figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img src="plots/dedup_all_dumps_bad.png"/></a></figure>
|
|
|
254 |
<p>This was quite puzzling as our intuition regarding web
|
255 |
data was that more deduplication would always result in improved performance. We decided to take a closer
|
256 |
look at one of the oldest dumps, dump 2013-48:</p>
|
|
|
276 |
</li>
|
277 |
</ul>
|
278 |
<figure class="image"><a
|
279 |
+
href="plots/removed_data_cross_dedup.png"><img src="plots/removed_data_cross_dedup.png"/></a></figure>
|
|
|
280 |
<p>These results show that, for this older dump where we were
|
281 |
removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
|
282 |
removed (considered independently from all the other dumps).</p>
|
|
|
287 |
<p>When training on a random sample from this dataset we see
|
288 |
that it now matches RefinedWeb’s performance (blue and red curves below):</p>
|
289 |
<figure class="image"><a
|
290 |
+
href="plots/cross_ind_unfiltered_comparison.png"><img src="plots/cross_ind_unfiltered_comparison.png"/></a>
|
|
|
291 |
</figure>
|
292 |
<p>We hypothesis that the main improvement gained from
|
293 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
|
|
335 |
<p>We then simulated uniformly sampling documents from this
|
336 |
entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
|
337 |
below you can see how often each document would be repeated.</p>
|
338 |
+
<figure class="image"><a href="plots/dedup_impact_simulation.png"><img src="plots/dedup_impact_simulation.png"/></a></figure>
|
|
|
339 |
<p>For 1B almost all documents would be unique
|
340 |
(#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
|
341 |
dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
|
|
|
375 |
<p>The performance of the models trained on each of these was
|
376 |
consistently worse (even if to different degrees) than that of the original independently deduplicated
|
377 |
data:</p>
|
378 |
+
<figure class="image"><a href="plots/Untitled.png"><img src="plots/Untitled.png"/></a></figure>
|
|
|
379 |
<h2>Additional filtering</h2>
|
380 |
<p>By this point we had reached the same performance as
|
381 |
RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
|
|
|
396 |
<a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
|
397 |
each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
|
398 |
(plot smoothed with a 3 checkpoints sliding window):</p>
|
399 |
+
<figure class="image"><a href="plots/c4_filters.png"><img src="plots/c4_filters.png"/></a></figure>
|
|
|
400 |
<ul class="bulleted-list">
|
401 |
<li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
|
402 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
|
|
444 |
caused by lower quality data on the full dedup version, we inspected histograms and manually defined
|
445 |
thresholds for the metrics where these differences were starker. This process yielded 17 candidate
|
446 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
447 |
+
<figure class="image"><a href="plots/Untitled%201.png"><img src="plots/Untitled%201.png"/></a></figure>
|
|
|
448 |
|
449 |
<p>To assess the effectiveness of these newly created
|
450 |
filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
|
|
|
468 |
<ul class="bulleted-list">
|
469 |
<li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
|
470 |
</ul>
|
471 |
+
<figure class="image"><a href="plots/Untitled%202.png"><img src="plots/Untitled%202.png"/></a></figure>
|
|
|
472 |
<hr />
|
473 |
<h1>The final dataset</h1>
|
474 |
<p>The final FineWeb dataset comprises 15T tokens and
|
|
|
486 |
<ul class="bulleted-list">
|
487 |
<li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
|
488 |
</ul>
|
489 |
+
<figure class="image"><a href="plots/fineweb_all_filters.png"><img src="plots/fineweb_all_filters.png"/></a></figure>
|
|
|
490 |
<p>We compared 🍷 FineWeb with the following datasets:</p>
|
491 |
<ul class="bulleted-list">
|
492 |
<li style="list-style-type:disc"><a
|
|
|
520 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
521 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
522 |
results here</a>.</p>
|
523 |
+
<figure class="image"><a href="plots/fineweb_ablations.png"><img src="plots/fineweb_ablations.png"/></a></figure>
|
|
|
524 |
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and
|
525 |
FineWeb:</p>
|
526 |
+
<figure class="image"><a href="plots/Untitled%203.png"><img src="plots/Untitled%203.png"/></a></figure>
|
|
|
527 |
<hr />
|
528 |
<h1>Just like fine wine, not all crawls are created
|
529 |
equal</h1>
|
|
|
534 |
the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
|
535 |
<p>The plot below clearly shows that some dumps perform far
|
536 |
worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
|
537 |
+
<figure class="image"><a href="plots/score_by_dump.png"><img src="plots/score_by_dump.png"/></a></figure>
|
|
|
538 |
<p>We identified 5 main relevant time intervals:</p>
|
539 |
<ul class="bulleted-list">
|
540 |
<li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
|
|
|
576 |
crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
|
577 |
it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
|
578 |
that alternatively new FQDNs were added to the top 60k.</p>
|
579 |
+
<figure class="image"><a href="plots/Untitled%204.png"><img src="plots/Untitled%204.png"/></a></figure>
|
|
|
580 |
<p>The data indicates three significant changes:
|
581 |
2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
|
582 |
<p>The explanation for the changes between 2022-33/2022-40
|
|
|
608 |
not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
|
609 |
dumps, one would expect these frequencies to remain approximately constant over time.</p>
|
610 |
<p>The results are shown in the following graph:</p>
|
611 |
+
<figure class="image"><a href="plots/Untitled%205.png"><img src="plots/Untitled%205.png"/></a></figure>
|
|
|
612 |
<p>While the frequency remained approximately constant until
|
613 |
2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
|
614 |
in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
|
|
|
623 |
evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
|
624 |
of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
|
625 |
|
626 |
+
<figure class="image"><a href="plots/Untitled%206.png"><img src="plots/Untitled%206.png"/></a></figure>
|
|
|
627 |
<hr />
|
628 |
<h1>Next steps</h1>
|
629 |
<p>We want to continue improving FineWeb and will also
|