Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

hynky HF Staff commited on May 24, 2024

Commit

548ee1d

1 Parent(s): 2322af0

remove img width

Browse files

Files changed (1) hide show

index.html +20 -39

index.html CHANGED Viewed

@@ -151,11 +151,9 @@
     <style>
          .neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
     </style>
-    <div class="neighborhood-figure-container">
-        <figure class="image">
-            <img style="width:708px" src="plots/fineweb-recipe.png"/>
-        </figure>
-    </div>
     <h2>Starting point: text extraction</h2>
     <p>CommonCrawl data is available in two main formats: WARC
         and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
@@ -175,8 +173,7 @@
         resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
         quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
         these additional tokens on the WET files are unnecessary page boilerplate.</p>
-    <figure class="image"><a href="plots/wet_comparison.png"><img
-            style="width:640px" src="plots/wet_comparison.png"/></a></figure>
     <h2>Base filtering</h2>
     <p>Filtering is an important part of the curation process. It
@@ -233,8 +230,7 @@
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
     <figure class="image"><a
-            href="plots/minhash_parameters_comparison.png"><img style="width:567px"
-                                                                src="plots/minhash_parameters_comparison.png"/></a>
     </figure>
     <p>While the high number of hash functions in RefinedWeb
         allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
@@ -254,8 +250,7 @@
         trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
         tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
         green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
-    <figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img
-            style="width:576px" src="plots/dedup_all_dumps_bad.png"/></a></figure>
     <p>This was quite puzzling as our intuition regarding web
         data was that more deduplication would always result in improved performance. We decided to take a closer
         look at one of the oldest dumps, dump 2013-48:</p>
@@ -281,8 +276,7 @@
         </li>
     </ul>
     <figure class="image"><a
-            href="plots/removed_data_cross_dedup.png"><img style="width:576px"
-                                                           src="plots/removed_data_cross_dedup.png"/></a></figure>
     <p>These results show that, for this older dump where we were
         removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
         removed (considered independently from all the other dumps).</p>
@@ -293,8 +287,7 @@
     <p>When training on a random sample from this dataset we see
         that it now matches RefinedWeb’s performance (blue and red curves below):</p>
     <figure class="image"><a
-            href="plots/cross_ind_unfiltered_comparison.png"><img style="width:576px"
-                                                                  src="plots/cross_ind_unfiltered_comparison.png"/></a>
     </figure>
     <p>We hypothesis that the main improvement gained from
         deduplication is the removal of very large clusters that are present in every single dump (you will find
@@ -342,8 +335,7 @@
     <p>We then simulated uniformly sampling documents from this
         entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
         below you can see how often each document would be repeated.</p>
-    <figure class="image"><a href="plots/dedup_impact_simulation.png"><img
-            style="width:708px" src="plots/dedup_impact_simulation.png"/></a></figure>
     <p>For 1B almost all documents would be unique
         (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
         dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
@@ -383,8 +375,7 @@
     <p>The performance of the models trained on each of these was
         consistently worse (even if to different degrees) than that of the original independently deduplicated
         data:</p>
-    <figure class="image"><a href="plots/Untitled.png"><img
-            style="width:708px" src="plots/Untitled.png"/></a></figure>
     <h2>Additional filtering</h2>
     <p>By this point we had reached the same performance as
         RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
@@ -405,8 +396,7 @@
         <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
         each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
         (plot smoothed with a 3 checkpoints sliding window):</p>
-    <figure class="image"><a href="plots/c4_filters.png"><img
-            style="width:708px" src="plots/c4_filters.png"/></a></figure>
     <ul class="bulleted-list">
         <li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
@@ -454,8 +444,7 @@
         caused by lower quality data on the full dedup version, we inspected histograms and manually defined
         thresholds for the metrics where these differences were starker. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
-    <figure class="image"><a href="plots/Untitled%201.png"><img
-            style="width:790px" src="plots/Untitled%201.png"/></a></figure>
     <p>To assess the effectiveness of these newly created
         filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
@@ -479,8 +468,7 @@
     <ul class="bulleted-list">
         <li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
     </ul>
-    <figure class="image"><a href="plots/Untitled%202.png"><img
-            style="width:708px" src="plots/Untitled%202.png"/></a></figure>
     <hr />
     <h1>The final dataset</h1>
     <p>The final FineWeb dataset comprises 15T tokens and
@@ -498,8 +486,7 @@
     <ul class="bulleted-list">
         <li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
     </ul>
-    <figure class="image"><a href="plots/fineweb_all_filters.png"><img
-            style="width:708px" src="plots/fineweb_all_filters.png"/></a></figure>
     <p>We compared 🍷 FineWeb with the following datasets:</p>
     <ul class="bulleted-list">
         <li style="list-style-type:disc"><a
@@ -533,12 +520,10 @@
         collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
         results here</a>.</p>
-    <figure class="image"><a href="plots/fineweb_ablations.png"><img
-            style="width:708px" src="plots/fineweb_ablations.png"/></a></figure>
     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
         FineWeb:</p>
-    <figure class="image"><a href="plots/Untitled%203.png"><img
-            style="width:4587px" src="plots/Untitled%203.png"/></a></figure>
     <hr />
     <h1>Just like fine wine, not all crawls are created
         equal</h1>
@@ -549,8 +534,7 @@
         the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
     <p>The plot below clearly shows that some dumps perform far
         worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
-    <figure class="image"><a href="plots/score_by_dump.png"><img
-            style="width:708px" src="plots/score_by_dump.png"/></a></figure>
     <p>We identified 5 main relevant time intervals:</p>
     <ul class="bulleted-list">
         <li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
@@ -592,8 +576,7 @@
         crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
         it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
         that alternatively new FQDNs were added to the top 60k.</p>
-    <figure class="image"><a href="plots/Untitled%204.png"><img
-            style="width:5026px" src="plots/Untitled%204.png"/></a></figure>
     <p>The data indicates three significant changes:
         2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
     <p>The explanation for the changes between 2022-33/2022-40
@@ -625,8 +608,7 @@
         not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
         dumps, one would expect these frequencies to remain approximately constant over time.</p>
     <p>The results are shown in the following graph:</p>
-    <figure class="image"><a href="plots/Untitled%205.png"><img
-            style="width:4156px" src="plots/Untitled%205.png"/></a></figure>
     <p>While the frequency remained approximately constant until
         2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
         in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
@@ -641,8 +623,7 @@
         evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
         of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
-    <figure class="image"><a href="plots/Untitled%206.png"><img
-            style="width:708px" src="plots/Untitled%206.png"/></a></figure>
     <hr />
     <h1>Next steps</h1>
     <p>We want to continue improving FineWeb and will also

     <style>
          .neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
     </style>
+    <figure class="l-body figure">
+        <img src="plots/fineweb-recipe.png"/>
+    </figure>
     <h2>Starting point: text extraction</h2>
     <p>CommonCrawl data is available in two main formats: WARC
         and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
         resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
         quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
         these additional tokens on the WET files are unnecessary page boilerplate.</p>
+    <figure class="image"><a href="plots/wet_comparison.png"><img src="plots/wet_comparison.png"/></a></figure>
     <h2>Base filtering</h2>
     <p>Filtering is an important part of the curation process. It
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
     <figure class="image"><a
+            href="plots/minhash_parameters_comparison.png"><img src="plots/minhash_parameters_comparison.png"/></a>
     </figure>
     <p>While the high number of hash functions in RefinedWeb
         allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
         trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
         tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
         green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
+    <figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img src="plots/dedup_all_dumps_bad.png"/></a></figure>
     <p>This was quite puzzling as our intuition regarding web
         data was that more deduplication would always result in improved performance. We decided to take a closer
         look at one of the oldest dumps, dump 2013-48:</p>
         </li>
     </ul>
     <figure class="image"><a
+            href="plots/removed_data_cross_dedup.png"><img src="plots/removed_data_cross_dedup.png"/></a></figure>
     <p>These results show that, for this older dump where we were
         removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
         removed (considered independently from all the other dumps).</p>
     <p>When training on a random sample from this dataset we see
         that it now matches RefinedWeb’s performance (blue and red curves below):</p>
     <figure class="image"><a
+            href="plots/cross_ind_unfiltered_comparison.png"><img src="plots/cross_ind_unfiltered_comparison.png"/></a>
     </figure>
     <p>We hypothesis that the main improvement gained from
         deduplication is the removal of very large clusters that are present in every single dump (you will find
     <p>We then simulated uniformly sampling documents from this
         entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
         below you can see how often each document would be repeated.</p>
+    <figure class="image"><a href="plots/dedup_impact_simulation.png"><img src="plots/dedup_impact_simulation.png"/></a></figure>
     <p>For 1B almost all documents would be unique
         (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
         dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
     <p>The performance of the models trained on each of these was
         consistently worse (even if to different degrees) than that of the original independently deduplicated
         data:</p>
+    <figure class="image"><a href="plots/Untitled.png"><img src="plots/Untitled.png"/></a></figure>
     <h2>Additional filtering</h2>
     <p>By this point we had reached the same performance as
         RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
         <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
         each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
         (plot smoothed with a 3 checkpoints sliding window):</p>
+    <figure class="image"><a href="plots/c4_filters.png"><img src="plots/c4_filters.png"/></a></figure>
     <ul class="bulleted-list">
         <li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
         caused by lower quality data on the full dedup version, we inspected histograms and manually defined
         thresholds for the metrics where these differences were starker. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
+    <figure class="image"><a href="plots/Untitled%201.png"><img src="plots/Untitled%201.png"/></a></figure>
     <p>To assess the effectiveness of these newly created
         filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
     <ul class="bulleted-list">
         <li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
     </ul>
+    <figure class="image"><a href="plots/Untitled%202.png"><img src="plots/Untitled%202.png"/></a></figure>
     <hr />
     <h1>The final dataset</h1>
     <p>The final FineWeb dataset comprises 15T tokens and
     <ul class="bulleted-list">
         <li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
     </ul>
+    <figure class="image"><a href="plots/fineweb_all_filters.png"><img src="plots/fineweb_all_filters.png"/></a></figure>
     <p>We compared 🍷 FineWeb with the following datasets:</p>
     <ul class="bulleted-list">
         <li style="list-style-type:disc"><a
         collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
         results here</a>.</p>
+    <figure class="image"><a href="plots/fineweb_ablations.png"><img src="plots/fineweb_ablations.png"/></a></figure>
     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
         FineWeb:</p>
+    <figure class="image"><a href="plots/Untitled%203.png"><img src="plots/Untitled%203.png"/></a></figure>
     <hr />
     <h1>Just like fine wine, not all crawls are created
         equal</h1>
         the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
     <p>The plot below clearly shows that some dumps perform far
         worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
+    <figure class="image"><a href="plots/score_by_dump.png"><img src="plots/score_by_dump.png"/></a></figure>
     <p>We identified 5 main relevant time intervals:</p>
     <ul class="bulleted-list">
         <li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
         crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
         it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
         that alternatively new FQDNs were added to the top 60k.</p>
+    <figure class="image"><a href="plots/Untitled%204.png"><img src="plots/Untitled%204.png"/></a></figure>
     <p>The data indicates three significant changes:
         2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
     <p>The explanation for the changes between 2022-33/2022-40
         not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
         dumps, one would expect these frequencies to remain approximately constant over time.</p>
     <p>The results are shown in the following graph:</p>
+    <figure class="image"><a href="plots/Untitled%205.png"><img src="plots/Untitled%205.png"/></a></figure>
     <p>While the frequency remained approximately constant until
         2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
         in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
         evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
         of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
+    <figure class="image"><a href="plots/Untitled%206.png"><img src="plots/Untitled%206.png"/></a></figure>
     <hr />
     <h1>Next steps</h1>
     <p>We want to continue improving FineWeb and will also