Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF Staff commited on May 27, 2024

Commit

b3970b3

1 Parent(s): 548ee1d

added ToC and other changes

Browse files

Files changed (2) hide show

README.md +0 -2
index.html +280 -149

README.md CHANGED Viewed

@@ -7,5 +7,3 @@ sdk: static
 pinned: false
 header: mini
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 pinned: false
 header: mini
 ---

index.html CHANGED Viewed

@@ -5,6 +5,111 @@
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
     <title>FineWeb: 15T tokens of high quality web data</title>
 </head>
 <body>
@@ -44,12 +149,12 @@
     <figure style="grid-column: page; mix-blend-mode: multiply;">
         <img src="banner.png" alt="FineWeb">
     </figure>
-    <!--    <figure style="grid-column: page; margin: 1rem 0;"><img src="banner.png"-->
-    <!--                                                            style="width:100%; border: 1px solid rgba(0, 0, 0, 0.2);"/>-->
-    <!--    </figure>-->
 </d-title>
 <d-byline></d-byline>
 <d-article>
     <p>We have recently released 🍷FineWeb, our new large scale
         (15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
@@ -59,19 +164,19 @@
     <p><strong>TLDR:</strong> This blog covers the FineWeb
         recipe, why more deduplication is not always better and some interesting findings on the difference in
         quality of CommonCrawl dumps.</p>
-    <hr/>
-    <h1>Preamble</h1>
-    <h2>Sourcing the data</h2>
     <p>A common question we see asked regarding web datasets used
         to train LLMs is “where do they even get all that data?” There are generally two options:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">you either crawl it yourself, like <a
                 href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
                 href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">you use a public repository of crawled webpages, like the one maintained by
             the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
     </ul>
     <p>For FineWeb, similarly to what was done for a large number
@@ -81,7 +186,7 @@
     <p>As an example, their latest crawl (2024-10) contains 3.16
         billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There
         are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p>
-    <h2>Processing at scale</h2>
     <p>Given the sheer size of the data involved, one of the main
         challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
         on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads
@@ -89,9 +194,9 @@
     <p>For this purpose, we developed <a
             href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data
         processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
-        CPU cores. All of the data processing steps involved in the creation of FineWeb used this <a
                 href="https://github.com/huggingface/datatrove">library</a>.</p>
-    <h2>What is clean, good data?</h2>
     <p>This is probably the main question to keep in mind when
         creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a
         human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p>
@@ -127,14 +232,14 @@
             href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
         benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
         billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">small variance between runs trained on different samplings of the same
             dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
             resulting scores to have as little noise as possible
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">performance increasing monotonically (or close) over a training run:
             ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
             (should not be too noisy)
         </li>
@@ -143,18 +248,14 @@
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To
         have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
         min on a single node of 8 GPUs - done in parallel to the training).</p>
-    <hr />
-    <h1>The FineWeb recipe</h1>
     <p>In the next subsections we will explain each of the steps
         taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a
                 href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p>
-    <style>
-         .neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
-    </style>
-    <figure class="l-body figure">
         <img src="plots/fineweb-recipe.png"/>
     </figure>
-    <h2>Starting point: text extraction</h2>
     <p>CommonCrawl data is available in two main formats: WARC
         and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
         full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
@@ -173,38 +274,38 @@
         resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
         quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
         these additional tokens on the WET files are unnecessary page boilerplate.</p>
-    <figure class="image"><a href="plots/wet_comparison.png"><img src="plots/wet_comparison.png"/></a></figure>
-    <h2>Base filtering</h2>
     <p>Filtering is an important part of the curation process. It
         removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
         deemed to be “lower quality”.</p>
     <p>As a basis for our filtering we used part of the setup
         from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">Applied URL filtering using a <a
                 href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">Applied a <a
                 href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to
             keep only English text with a score ≥ 0.65
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">Applied quality and repetition filters from the <a
                 href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
         extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
         tokenized with the <code>gpt2</code> tokenizer).</p>
-    <h2>Deduplication</h2>
     <p>Deduplication is another important step, specially for web
         datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of
         the most important steps when creating large web datasets for LLMs.</p>
-    <h3>Why deduplicate?</h3>
     <p>The web has many aggregators, mirrors, templated pages or
         just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
         can be introduced by the crawler itself, when different links point to the same page. </p>
@@ -229,8 +330,7 @@
         92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
-    <figure class="image"><a
-            href="plots/minhash_parameters_comparison.png"><img src="plots/minhash_parameters_comparison.png"/></a>
     </figure>
     <p>While the high number of hash functions in RefinedWeb
         allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
@@ -250,46 +350,44 @@
         trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
         tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
         green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
-    <figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img src="plots/dedup_all_dumps_bad.png"/></a></figure>
     <p>This was quite puzzling as our intuition regarding web
         data was that more deduplication would always result in improved performance. We decided to take a closer
         look at one of the oldest dumps, dump 2013-48:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">pre deduplication, this dump had ~490 billion tokens</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">after our iterative MinHash, ~31 billion tokens remained (94% of data
             removed)
         </li>
     </ul>
     <p>As an experiment, we tried training two models on 28BT
         sampled from the following data from 2013-48:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">the fully deduplicated remaining ~31 billion tokens (<em>originally kept
             data</em>)
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">171 billion tokens obtained by individually deduplicating (without
             considering the other dumps) the ~460 billion tokens that had been removed from this dump in the
             iterative dedup process (<em>originally removed data</em>)
         </li>
     </ul>
-    <figure class="image"><a
-            href="plots/removed_data_cross_dedup.png"><img src="plots/removed_data_cross_dedup.png"/></a></figure>
     <p>These results show that, for this older dump where we were
         removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
-        removed (considered independently from all the other dumps).</p>
     <h3>Taking a step back: individual dump dedup</h3>
     <p>We then tried an alternative approach: we deduplicated
         each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
         tokens of data.</p>
     <p>When training on a random sample from this dataset we see
         that it now matches RefinedWeb’s performance (blue and red curves below):</p>
-    <figure class="image"><a
-            href="plots/cross_ind_unfiltered_comparison.png"><img src="plots/cross_ind_unfiltered_comparison.png"/></a>
     </figure>
-    <p>We hypothesis that the main improvement gained from
         deduplication is the removal of very large clusters that are present in every single dump (you will find
         some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
         documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number
@@ -302,7 +400,7 @@
         improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
         lower quality data. We also experimented with applying different, and often “lighter”, deduplication
         approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
-    <h3>A note on measuring the effect of deduplication</h3>
     <p>Given the nature of deduplication, its effect is not
         always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
         filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
@@ -310,32 +408,32 @@
     <p>To visualize the effect of scaling the number of training
         tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic
         regarding the degree of duplication observed) theoretical scenario:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">there are 100 CommonCrawl dumps (actually roughly true)</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">each dump has been perfectly individually deduplicated (every single
             document in it is unique)
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">each dump is a perfect copy of each other (maximum possible duplication
             across dumps, effectively the worst case scenario)
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">each dump has 200 billion tokens (for a total of 20 trillion, the resulting
             size of our individual dedup above)
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">each dump is made up of documents of 1k tokens (200M documents per dump)
         </li>
     </ul>
     <p>We then simulated uniformly sampling documents from this
         entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
         below you can see how often each document would be repeated.</p>
-    <figure class="image"><a href="plots/dedup_impact_simulation.png"><img src="plots/dedup_impact_simulation.png"/></a></figure>
     <p>For 1B almost all documents would be unique
         (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
         dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
@@ -347,26 +445,26 @@
         documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
         measuring deduplication impact on the training of LLMs, once the biggest document clusters have been
         removed.</p>
-    <h3>Other (failed) approaches</h3>
     <p>We attempted to improve the performance of the
         independently minhash deduped 20T of data by further deduplicating it with the following methods</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">URL deduplication, where we only kept one document per normalized
             (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">Line deduplication:
-            <ul class="bulleted-list">
-                <li style="list-style-type:circle">remove all but 1 occurrence of each duplicated line (77.8% of
                     tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
             </ul>
-            <ul class="bulleted-list">
-                <li style="list-style-type:circle">same as above, but only removing duplicate lines with at least 10
                     words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
                     dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
             </ul>
-            <ul class="bulleted-list">
-                <li style="list-style-type:circle">remove all but 1 occurrence of each span of 3 duplicated lines
                     with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line
                         dedup</em></li>
             </ul>
@@ -375,8 +473,8 @@
     <p>The performance of the models trained on each of these was
         consistently worse (even if to different degrees) than that of the original independently deduplicated
         data:</p>
-    <figure class="image"><a href="plots/Untitled.png"><img src="plots/Untitled.png"/></a></figure>
-    <h2>Additional filtering</h2>
     <p>By this point we had reached the same performance as
         RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
                 href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with
@@ -384,7 +482,7 @@
     <p>We therefore set out to find new filtering steps that
         would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point
         was to look into the processing of C4 itself.</p>
-    <h3>C4: A dataset that has stood the test of time</h3>
     <p>The <a href="https://huggingface.co/datasets/c4">C4
         dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by
         removing non english data, applying some heuristic filters on both the line and document level,
@@ -396,38 +494,38 @@
         <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
         each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
         (plot smoothed with a 3 checkpoints sliding window):</p>
-    <figure class="image"><a href="plots/c4_filters.png"><img src="plots/c4_filters.png"/></a></figure>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
             ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
             pink curves).
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">The curly bracket filter, and the word lengths filter only give a small
             boost, removing 2.8% and 4.3% of tokens, respectively
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">The terminal punctuation filter, by itself, gives the biggest individual
             boost, but removes <em>around 30%</em> of all tokens (!)
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">The lorem_ipsum, javascript and policy rules each remove &lt;0.5% of
             training tokens, so we did not train on them individually
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">All filters except the very destructive terminal_punct perform better than
             terminal_punct by itself, while removing less in total (~7%)
         </li>
     </ul>
     <p>We decided to apply all C4 filters mentioned above except
         the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
         the next section.</p>
-    <h3>A statistical approach to develop heuristic filters</h3>
     <p>To come up with new possible filtering rules, we collected
         a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference
         datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
@@ -444,73 +542,72 @@
         caused by lower quality data on the full dedup version, we inspected histograms and manually defined
         thresholds for the metrics where these differences were starker. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
-    <figure class="image"><a href="plots/Untitled%201.png"><img src="plots/Untitled%201.png"/></a></figure>
     <p>To assess the effectiveness of these newly created
         filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
         of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
         the most significant improvements on the aggregate score:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">Remove documents where the fraction of lines ending with punctuation ≤ 0.12
             (10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">Remove documents where the fraction of characters in duplicated lines ≥ 0.1
             (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">Remove documents where the fraction of lines shorter than 30 characters ≥
             0.67 (3.73% of tokens removed)
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
     </ul>
-    <figure class="image"><a href="plots/Untitled%202.png"><img src="plots/Untitled%202.png"/></a></figure>
-    <hr />
-    <h1>The final dataset</h1>
     <p>The final FineWeb dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">base filtering</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">independent MinHash deduplication per dump</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">a selection of C4 filters</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
     </ul>
-    <figure class="image"><a href="plots/fineweb_all_filters.png"><img src="plots/fineweb_all_filters.png"/></a></figure>
     <p>We compared 🍷 FineWeb with the following datasets:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc"><a
                 href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a>
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/c4">C4</a></li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
             CommonCrawl part)
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a></li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc"><a
                 href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a>
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc"><a
                 href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a>
             (deduplicated)
         </li>
@@ -520,13 +617,12 @@
         collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
         results here</a>.</p>
-    <figure class="image"><a href="plots/fineweb_ablations.png"><img src="plots/fineweb_ablations.png"/></a></figure>
     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
         FineWeb:</p>
-    <figure class="image"><a href="plots/Untitled%203.png"><img src="plots/Untitled%203.png"/></a></figure>
-    <hr />
-    <h1>Just like fine wine, not all crawls are created
-        equal</h1>
     <p>During our ablation runs, we observed that certain crawls
         outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
         each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
@@ -534,24 +630,24 @@
         the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
     <p>The plot below clearly shows that some dumps perform far
         worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
-    <figure class="image"><a href="plots/score_by_dump.png"><img src="plots/score_by_dump.png"/></a></figure>
     <p>We identified 5 main relevant time intervals:</p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">2017 to 2018: high quality, with a drop by the end of 2018</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">2019 to 2021: high quality, steadily increase</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">2021-49 and 2022: very large drop in performance, followed by worse quality
             dumps
         </li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">2023 and 2024-10: almost exponential improvement. In particular, 2023-50
             and 2024-10 are by far the best dumps
         </li>
     </ul>
@@ -559,14 +655,14 @@
         models on &lt; 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
     <p>We conducted further analysis to investigate the factors
         causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">large sudden changes in the list of crawled URLs;</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">synthetic (LLM generated) data;</li>
     </ul>
-    <ul class="bulleted-list">
-        <li style="list-style-type:disc">benchmark contamination;</li>
     </ul>
     <p>We go over each one in the following sections.</p>
     <h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
@@ -576,7 +672,7 @@
         crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
         it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
         that alternatively new FQDNs were added to the top 60k.</p>
-    <figure class="image"><a href="plots/Untitled%204.png"><img src="plots/Untitled%204.png"/></a></figure>
     <p>The data indicates three significant changes:
         2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
     <p>The explanation for the changes between 2022-33/2022-40
@@ -608,7 +704,7 @@
         not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
         dumps, one would expect these frequencies to remain approximately constant over time.</p>
     <p>The results are shown in the following graph:</p>
-    <figure class="image"><a href="plots/Untitled%205.png"><img src="plots/Untitled%205.png"/></a></figure>
     <p>While the frequency remained approximately constant until
         2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
         in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
@@ -623,9 +719,8 @@
         evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
         of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
-    <figure class="image"><a href="plots/Untitled%206.png"><img src="plots/Untitled%206.png"/></a></figure>
-    <hr />
-    <h1>Next steps</h1>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>
     <p>Adapting the FineWeb recipe [wip]</p>
@@ -640,4 +735,40 @@
     <d-bibliography src="bibliography.bib"></d-bibliography>
 </d-appendix>
 </body>

     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
     <title>FineWeb: 15T tokens of high quality web data</title>
+    <style>
+        /* ****************************************
+         * TOC
+         ******************************************/
+        @media (max-width: 1199px) {
+            d-contents {
+                display: none;
+                justify-self: start;
+                align-self: start;
+                padding-bottom: 0.5em;
+                margin-bottom: 1em;
+                padding-left: 0.25em;
+                border-bottom: 1px solid rgba(0, 0, 0, 0.1);
+                border-bottom-width: 1px;
+                border-bottom-style: solid;
+                border-bottom-color: rgba(0, 0, 0, 0.1);
+            }
+        }
+        d-contents a:hover {
+            border-bottom: none;
+        }
+        @media (min-width: 1200px) {
+            d-article {
+                /* Ensure d-article does not prevent sticky positioning */
+                overflow: visible;
+            }
+            d-contents {
+                align-self: start;
+                grid-column-start: 1 !important;
+                grid-column-end: 4 !important;
+                grid-row: auto / span 6;
+                justify-self: end;
+                margin-top: 0em;
+                padding-right: 3em;
+                padding-left: 2em;
+                border-right: 1px solid rgba(0, 0, 0, 0.1);
+                border-right-width: 1px;
+                border-right-style: solid;
+                border-right-color: rgba(0, 0, 0, 0.1);
+                position: -webkit-sticky; /* For Safari */
+                position: sticky;
+                top: 0; /* Adjust this value if needed */
+            }
+        }
+        d-contents nav h3 {
+            margin-top: 0;
+            margin-bottom: 1em;
+        }
+        d-contents nav div {
+            color: rgba(0, 0, 0, 0.8);
+            font-weight: bold;
+        }
+        d-contents nav a {
+            color: rgba(0, 0, 0, 0.8);
+            border-bottom: none;
+            text-decoration: none;
+        }
+        d-contents li {
+            list-style-type: none;
+        }
+        d-contents ul, d-article d-contents ul {
+            padding-left: 1em;
+        }
+        d-contents nav ul li {
+            margin-bottom: .25em;
+        }
+        d-contents nav a:hover {
+            text-decoration: underline solid rgba(0, 0, 0, 0.6);
+        }
+        d-contents nav ul {
+            margin-top: 0;
+            margin-bottom: 6px;
+        }
+        d-contents nav > div {
+            display: block;
+            outline: none;
+            margin-bottom: 0.5em;
+        }
+        d-contents nav > div > a {
+            font-size: 13px;
+            font-weight: 600;
+        }
+        d-contents nav > div > a:hover,
+        d-contents nav > ul > li > a:hover {
+            text-decoration: none;
+        }
+    </style>
 </head>
 <body>
     <figure style="grid-column: page; mix-blend-mode: multiply;">
         <img src="banner.png" alt="FineWeb">
     </figure>
 </d-title>
 <d-byline></d-byline>
 <d-article>
+    <d-contents>
+    </d-contents>
     <p>We have recently released 🍷FineWeb, our new large scale
         (15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
     <p><strong>TLDR:</strong> This blog covers the FineWeb
         recipe, why more deduplication is not always better and some interesting findings on the difference in
         quality of CommonCrawl dumps.</p>
+    <h2>Preamble</h2>
+    <h3>Sourcing the data</h3>
     <p>A common question we see asked regarding web datasets used
         to train LLMs is “where do they even get all that data?” There are generally two options:</p>
+    <ul>
+        <li>you either crawl it yourself, like <a
                 href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
                 href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
         </li>
     </ul>
+    <ul>
+        <li>you use a public repository of crawled webpages, like the one maintained by
             the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
     </ul>
     <p>For FineWeb, similarly to what was done for a large number
     <p>As an example, their latest crawl (2024-10) contains 3.16
         billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There
         are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p>
+    <h3>Processing at scale</h3>
     <p>Given the sheer size of the data involved, one of the main
         challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
         on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads
     <p>For this purpose, we developed <a
             href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data
         processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
+        CPU cores. All the data processing steps involved in the creation of FineWeb used this <a
                 href="https://github.com/huggingface/datatrove">library</a>.</p>
+    <h3>What is clean, good data?</h3>
     <p>This is probably the main question to keep in mind when
         creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a
         human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p>
             href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
         benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
         billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
+    <ul>
+        <li>small variance between runs trained on different samplings of the same
             dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
             resulting scores to have as little noise as possible
         </li>
     </ul>
+    <ul>
+        <li>performance increasing monotonically (or close) over a training run:
             ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
             (should not be too noisy)
         </li>
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To
         have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
         min on a single node of 8 GPUs - done in parallel to the training).</p>
+    <h2>The FineWeb recipe</h2>
     <p>In the next subsections we will explain each of the steps
         taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a
                 href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p>
+    <figure class="l-body">
         <img src="plots/fineweb-recipe.png"/>
     </figure>
+    <h3>Starting point: text extraction</h3>
     <p>CommonCrawl data is available in two main formats: WARC
         and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
         full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
         resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
         quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
         these additional tokens on the WET files are unnecessary page boilerplate.</p>
+    <figure><img src="plots/wet_comparison.png"/></figure>
+    <h3>Base filtering</h3>
     <p>Filtering is an important part of the curation process. It
         removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
         deemed to be “lower quality”.</p>
     <p>As a basis for our filtering we used part of the setup
         from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p>
+    <ul>
+        <li>Applied URL filtering using a <a
                 href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content
         </li>
     </ul>
+    <ul>
+        <li>Applied a <a
                 href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to
             keep only English text with a score ≥ 0.65
         </li>
     </ul>
+    <ul>
+        <li>Applied quality and repetition filters from the <a
                 href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
         extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
         tokenized with the <code>gpt2</code> tokenizer).</p>
+    <h3>Deduplication</h3>
     <p>Deduplication is another important step, specially for web
         datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of
         the most important steps when creating large web datasets for LLMs.</p>
+    <h4>Why deduplicate?</h4>
     <p>The web has many aggregators, mirrors, templated pages or
         just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
         can be introduced by the crawler itself, when different links point to the same page. </p>
         92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
+    <figure><img src="plots/minhash_parameters_comparison.png"/>
     </figure>
     <p>While the high number of hash functions in RefinedWeb
         allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
         trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
         tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
         green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
+    <figure><img src="plots/dedup_all_dumps_bad.png"/></figure>
     <p>This was quite puzzling as our intuition regarding web
         data was that more deduplication would always result in improved performance. We decided to take a closer
         look at one of the oldest dumps, dump 2013-48:</p>
+    <ul>
+        <li>pre deduplication, this dump had ~490 billion tokens</li>
     </ul>
+    <ul>
+        <li>after our iterative MinHash, ~31 billion tokens remained (94% of data
             removed)
         </li>
     </ul>
     <p>As an experiment, we tried training two models on 28BT
         sampled from the following data from 2013-48:</p>
+    <ul>
+        <li>the fully deduplicated remaining ~31 billion tokens (<em>originally kept
             data</em>)
         </li>
     </ul>
+    <ul>
+        <li>171 billion tokens obtained by individually deduplicating (without
             considering the other dumps) the ~460 billion tokens that had been removed from this dump in the
             iterative dedup process (<em>originally removed data</em>)
         </li>
     </ul>
+    <figure><img src="plots/removed_data_cross_dedup.png"/></figure>
     <p>These results show that, for this older dump where we were
         removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
+        removed (considered independently of all the other dumps).</p>
     <h3>Taking a step back: individual dump dedup</h3>
     <p>We then tried an alternative approach: we deduplicated
         each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
         tokens of data.</p>
     <p>When training on a random sample from this dataset we see
         that it now matches RefinedWeb’s performance (blue and red curves below):</p>
+    <figure><img src="plots/cross_ind_unfiltered_comparison.png"/>
     </figure>
+    <p>We hypothesize that the main improvement gained from
         deduplication is the removal of very large clusters that are present in every single dump (you will find
         some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
         documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number
         improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
         lower quality data. We also experimented with applying different, and often “lighter”, deduplication
         approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
+    <h4>A note on measuring the effect of deduplication</h4>
     <p>Given the nature of deduplication, its effect is not
         always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
         filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
     <p>To visualize the effect of scaling the number of training
         tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic
         regarding the degree of duplication observed) theoretical scenario:</p>
+    <ul>
+        <li>there are 100 CommonCrawl dumps (actually roughly true)</li>
     </ul>
+    <ul>
+        <li>each dump has been perfectly individually deduplicated (every single
             document in it is unique)
         </li>
     </ul>
+    <ul>
+        <li>each dump is a perfect copy of each other (maximum possible duplication
             across dumps, effectively the worst case scenario)
         </li>
     </ul>
+    <ul>
+        <li>each dump has 200 billion tokens (for a total of 20 trillion, the resulting
             size of our individual dedup above)
         </li>
     </ul>
+    <ul>
+        <li>each dump is made up of documents of 1k tokens (200M documents per dump)
         </li>
     </ul>
     <p>We then simulated uniformly sampling documents from this
         entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
         below you can see how often each document would be repeated.</p>
+    <figure><img src="plots/dedup_impact_simulation.png"/></figure>
     <p>For 1B almost all documents would be unique
         (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
         dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
         documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
         measuring deduplication impact on the training of LLMs, once the biggest document clusters have been
         removed.</p>
+    <h4>Other (failed) approaches</h4>
     <p>We attempted to improve the performance of the
         independently minhash deduped 20T of data by further deduplicating it with the following methods</p>
+    <ul>
+        <li>URL deduplication, where we only kept one document per normalized
             (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
     </ul>
+    <ul>
+        <li>Line deduplication:
+            <ul>
+                <li>remove all but 1 occurrence of each duplicated line (77.8% of
                     tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
             </ul>
+            <ul>
+                <li>same as above, but only removing duplicate lines with at least 10
                     words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
                     dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
             </ul>
+            <ul>
+                <li>remove all but 1 occurrence of each span of 3 duplicated lines
                     with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line
                         dedup</em></li>
             </ul>
     <p>The performance of the models trained on each of these was
         consistently worse (even if to different degrees) than that of the original independently deduplicated
         data:</p>
+    <figure><img src="plots/Untitled.png"/></figure>
+    <h3>Additional filtering</h3>
     <p>By this point we had reached the same performance as
         RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
                 href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with
     <p>We therefore set out to find new filtering steps that
         would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point
         was to look into the processing of C4 itself.</p>
+    <h4>C4: A dataset that has stood the test of time</h4>
     <p>The <a href="https://huggingface.co/datasets/c4">C4
         dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by
         removing non english data, applying some heuristic filters on both the line and document level,
         <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
         each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
         (plot smoothed with a 3 checkpoints sliding window):</p>
+    <figure><img src="plots/c4_filters.png"/></figure>
+    <ul>
+        <li>applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
             ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
             pink curves).
         </li>
     </ul>
+    <ul>
+        <li>The curly bracket filter, and the word lengths filter only give a small
             boost, removing 2.8% and 4.3% of tokens, respectively
         </li>
     </ul>
+    <ul>
+        <li>The terminal punctuation filter, by itself, gives the biggest individual
             boost, but removes <em>around 30%</em> of all tokens (!)
         </li>
     </ul>
+    <ul>
+        <li>The lorem_ipsum, javascript and policy rules each remove &lt;0.5% of
             training tokens, so we did not train on them individually
         </li>
     </ul>
+    <ul>
+        <li>All filters except the very destructive terminal_punct perform better than
             terminal_punct by itself, while removing less in total (~7%)
         </li>
     </ul>
     <p>We decided to apply all C4 filters mentioned above except
         the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
         the next section.</p>
+    <h4>A statistical approach to develop heuristic filters</h4>
     <p>To come up with new possible filtering rules, we collected
         a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference
         datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
         caused by lower quality data on the full dedup version, we inspected histograms and manually defined
         thresholds for the metrics where these differences were starker. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
+    <figure><img src="plots/Untitled%201.png"/></figure>
     <p>To assess the effectiveness of these newly created
         filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
         of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
         the most significant improvements on the aggregate score:</p>
+    <ul>
+        <li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
             (10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
         </li>
     </ul>
+    <ul>
+        <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
             (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
         </li>
     </ul>
+    <ul>
+        <li>Remove documents where the fraction of lines shorter than 30 characters ≥
             0.67 (3.73% of tokens removed)
         </li>
     </ul>
+    <ul>
+        <li>When applying the 3 together, ~22% of tokens were removed</li>
     </ul>
+    <figure><img src="plots/Untitled%202.png"/></figure>
+    <h2>The final dataset</h2>
     <p>The final FineWeb dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
+    <ul>
+        <li>base filtering</li>
     </ul>
+    <ul>
+        <li>independent MinHash deduplication per dump</li>
     </ul>
+    <ul>
+        <li>a selection of C4 filters</li>
     </ul>
+    <ul>
+        <li>our custom filters (mentioned in the previous section)</li>
     </ul>
+    <figure><img src="plots/fineweb_all_filters.png"/></figure>
     <p>We compared 🍷 FineWeb with the following datasets:</p>
+    <ul>
+        <li><a
                 href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a>
         </li>
     </ul>
+    <ul>
+        <li><a href="https://huggingface.co/datasets/allenai/c4">C4</a></li>
     </ul>
+    <ul>
+        <li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
             CommonCrawl part)
         </li>
     </ul>
+    <ul>
+        <li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a></li>
     </ul>
+    <ul>
+        <li><a
                 href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a>
         </li>
     </ul>
+    <ul>
+        <li><a
                 href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a>
             (deduplicated)
         </li>
         collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
         results here</a>.</p>
+    <figure><img src="plots/fineweb_ablations.png"/></figure>
     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
         FineWeb:</p>
+    <figure><img src="plots/Untitled%203.png"/></figure>
+    <h2>Just like fine wine, not all crawls are created
+        equal</h2>
     <p>During our ablation runs, we observed that certain crawls
         outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
         each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
         the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
     <p>The plot below clearly shows that some dumps perform far
         worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
+    <figure><img src="plots/score_by_dump.png"/></figure>
     <p>We identified 5 main relevant time intervals:</p>
+    <ul>
+        <li>2013 to 2016: relatively stable, average quality</li>
     </ul>
+    <ul>
+        <li>2017 to 2018: high quality, with a drop by the end of 2018</li>
     </ul>
+    <ul>
+        <li>2019 to 2021: high quality, steadily increase</li>
     </ul>
+    <ul>
+        <li>2021-49 and 2022: very large drop in performance, followed by worse quality
             dumps
         </li>
     </ul>
+    <ul>
+        <li>2023 and 2024-10: almost exponential improvement. In particular, 2023-50
             and 2024-10 are by far the best dumps
         </li>
     </ul>
         models on &lt; 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
     <p>We conducted further analysis to investigate the factors
         causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
+    <ul>
+        <li>large sudden changes in the list of crawled URLs;</li>
     </ul>
+    <ul>
+        <li>synthetic (LLM generated) data;</li>
     </ul>
+    <ul>
+        <li>benchmark contamination;</li>
     </ul>
     <p>We go over each one in the following sections.</p>
     <h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
         crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
         it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
         that alternatively new FQDNs were added to the top 60k.</p>
+    <figure><img src="plots/Untitled%204.png"/></figure>
     <p>The data indicates three significant changes:
         2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
     <p>The explanation for the changes between 2022-33/2022-40
         not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
         dumps, one would expect these frequencies to remain approximately constant over time.</p>
     <p>The results are shown in the following graph:</p>
+    <figure><img src="plots/Untitled%205.png"/></figure>
     <p>While the frequency remained approximately constant until
         2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
         in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
         evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
         of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
+    <figure><img src="plots/Untitled%206.png"/></figure>
+    <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>
     <p>Adapting the FineWeb recipe [wip]</p>
     <d-bibliography src="bibliography.bib"></d-bibliography>
 </d-appendix>
+<script>
+    const article = document.querySelector('d-article');
+    const toc = document.querySelector('d-contents');
+    if (toc) {
+        const headings = article.querySelectorAll('h2, h3');
+        let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
+        let elements = [];
+        for (const el of headings) {
+            // should element be included in TOC?
+            const isInTitle = el.parentElement.tagName == 'D-TITLE';
+            const isException = el.getAttribute('no-toc');
+            if (isInTitle || isException) continue;
+            el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
+            const link = '<a href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
+            if (el.tagName === 'H2')
+                elements.push([link, []]);
+            else {
+                if (elements.length === 0)
+                    elements.push([null, []])
+                elements[elements.length - 1][1].push(link)
+            }
+        }
+        for (const topLevel of elements) {
+            ToC += '<div>' + topLevel[0] + '</div><ul>';
+            for (const subLevel of topLevel[1])
+                ToC += '<li>' + subLevel + '</li>';
+            ToC += '</ul>';
+        }
+        ToC += '</nav>';
+        toc.innerHTML = ToC;
+        toc.setAttribute('prerendered', 'true');
+    }
+</script>
 </body>