guipenedo HF staff commited on
Commit
b3970b3
·
1 Parent(s): 548ee1d

added ToC and other changes

Browse files
Files changed (2) hide show
  1. README.md +0 -2
  2. index.html +280 -149
README.md CHANGED
@@ -7,5 +7,3 @@ sdk: static
7
  pinned: false
8
  header: mini
9
  ---
10
-
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
7
  pinned: false
8
  header: mini
9
  ---
 
 
index.html CHANGED
@@ -5,6 +5,111 @@
5
  <meta name="viewport" content="width=device-width, initial-scale=1">
6
  <meta charset="utf8">
7
  <title>FineWeb: 15T tokens of high quality web data</title>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  </head>
9
 
10
  <body>
@@ -44,12 +149,12 @@
44
  <figure style="grid-column: page; mix-blend-mode: multiply;">
45
  <img src="banner.png" alt="FineWeb">
46
  </figure>
47
- <!-- <figure style="grid-column: page; margin: 1rem 0;"><img src="banner.png"-->
48
- <!-- style="width:100%; border: 1px solid rgba(0, 0, 0, 0.2);"/>-->
49
- <!-- </figure>-->
50
  </d-title>
51
  <d-byline></d-byline>
52
  <d-article>
 
 
 
53
  <p>We have recently released 🍷FineWeb, our new large scale
54
  (15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
55
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
@@ -59,19 +164,19 @@
59
  <p><strong>TLDR:</strong> This blog covers the FineWeb
60
  recipe, why more deduplication is not always better and some interesting findings on the difference in
61
  quality of CommonCrawl dumps.</p>
62
- <hr/>
63
- <h1>Preamble</h1>
64
- <h2>Sourcing the data</h2>
65
  <p>A common question we see asked regarding web datasets used
66
  to train LLMs is “where do they even get all that data?” There are generally two options:</p>
67
- <ul class="bulleted-list">
68
- <li style="list-style-type:disc">you either crawl it yourself, like <a
69
  href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
70
  href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
71
  </li>
72
  </ul>
73
- <ul class="bulleted-list">
74
- <li style="list-style-type:disc">you use a public repository of crawled webpages, like the one maintained by
75
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
76
  </ul>
77
  <p>For FineWeb, similarly to what was done for a large number
@@ -81,7 +186,7 @@
81
  <p>As an example, their latest crawl (2024-10) contains 3.16
82
  billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There
83
  are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p>
84
- <h2>Processing at scale</h2>
85
  <p>Given the sheer size of the data involved, one of the main
86
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
87
  on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads
@@ -89,9 +194,9 @@
89
  <p>For this purpose, we developed <a
90
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data
91
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
92
- CPU cores. All of the data processing steps involved in the creation of FineWeb used this <a
93
  href="https://github.com/huggingface/datatrove">library</a>.</p>
94
- <h2>What is clean, good data?</h2>
95
  <p>This is probably the main question to keep in mind when
96
  creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a
97
  human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p>
@@ -127,14 +232,14 @@
127
  href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
128
  benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
129
  billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
130
- <ul class="bulleted-list">
131
- <li style="list-style-type:disc">small variance between runs trained on different samplings of the same
132
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
133
  resulting scores to have as little noise as possible
134
  </li>
135
  </ul>
136
- <ul class="bulleted-list">
137
- <li style="list-style-type:disc">performance increasing monotonically (or close) over a training run:
138
  ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
139
  (should not be too noisy)
140
  </li>
@@ -143,18 +248,14 @@
143
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To
144
  have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
145
  min on a single node of 8 GPUs - done in parallel to the training).</p>
146
- <hr />
147
- <h1>The FineWeb recipe</h1>
148
  <p>In the next subsections we will explain each of the steps
149
  taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a
150
  href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p>
151
- <style>
152
- .neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
153
- </style>
154
- <figure class="l-body figure">
155
  <img src="plots/fineweb-recipe.png"/>
156
  </figure>
157
- <h2>Starting point: text extraction</h2>
158
  <p>CommonCrawl data is available in two main formats: WARC
159
  and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
160
  full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
@@ -173,38 +274,38 @@
173
  resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
174
  quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
175
  these additional tokens on the WET files are unnecessary page boilerplate.</p>
176
- <figure class="image"><a href="plots/wet_comparison.png"><img src="plots/wet_comparison.png"/></a></figure>
177
 
178
- <h2>Base filtering</h2>
179
  <p>Filtering is an important part of the curation process. It
180
  removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
181
  deemed to be “lower quality”.</p>
182
  <p>As a basis for our filtering we used part of the setup
183
  from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p>
184
- <ul class="bulleted-list">
185
- <li style="list-style-type:disc">Applied URL filtering using a <a
186
  href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content
187
  </li>
188
  </ul>
189
- <ul class="bulleted-list">
190
- <li style="list-style-type:disc">Applied a <a
191
  href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to
192
  keep only English text with a score ≥ 0.65
193
  </li>
194
  </ul>
195
- <ul class="bulleted-list">
196
- <li style="list-style-type:disc">Applied quality and repetition filters from the <a
197
  href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds)
198
  </li>
199
  </ul>
200
  <p>After applying this filtering to each of the text
201
  extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
202
  tokenized with the <code>gpt2</code> tokenizer).</p>
203
- <h2>Deduplication</h2>
204
  <p>Deduplication is another important step, specially for web
205
  datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of
206
  the most important steps when creating large web datasets for LLMs.</p>
207
- <h3>Why deduplicate?</h3>
208
  <p>The web has many aggregators, mirrors, templated pages or
209
  just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
210
  can be introduced by the crawler itself, when different links point to the same page. </p>
@@ -229,8 +330,7 @@
229
  92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability
230
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
231
  buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
232
- <figure class="image"><a
233
- href="plots/minhash_parameters_comparison.png"><img src="plots/minhash_parameters_comparison.png"/></a>
234
  </figure>
235
  <p>While the high number of hash functions in RefinedWeb
236
  allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
@@ -250,46 +350,44 @@
250
  trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
251
  tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
252
  green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
253
- <figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img src="plots/dedup_all_dumps_bad.png"/></a></figure>
254
  <p>This was quite puzzling as our intuition regarding web
255
  data was that more deduplication would always result in improved performance. We decided to take a closer
256
  look at one of the oldest dumps, dump 2013-48:</p>
257
- <ul class="bulleted-list">
258
- <li style="list-style-type:disc">pre deduplication, this dump had ~490 billion tokens</li>
259
  </ul>
260
- <ul class="bulleted-list">
261
- <li style="list-style-type:disc">after our iterative MinHash, ~31 billion tokens remained (94% of data
262
  removed)
263
  </li>
264
  </ul>
265
  <p>As an experiment, we tried training two models on 28BT
266
  sampled from the following data from 2013-48:</p>
267
- <ul class="bulleted-list">
268
- <li style="list-style-type:disc">the fully deduplicated remaining ~31 billion tokens (<em>originally kept
269
  data</em>)
270
  </li>
271
  </ul>
272
- <ul class="bulleted-list">
273
- <li style="list-style-type:disc">171 billion tokens obtained by individually deduplicating (without
274
  considering the other dumps) the ~460 billion tokens that had been removed from this dump in the
275
  iterative dedup process (<em>originally removed data</em>)
276
  </li>
277
  </ul>
278
- <figure class="image"><a
279
- href="plots/removed_data_cross_dedup.png"><img src="plots/removed_data_cross_dedup.png"/></a></figure>
280
  <p>These results show that, for this older dump where we were
281
  removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
282
- removed (considered independently from all the other dumps).</p>
283
  <h3>Taking a step back: individual dump dedup</h3>
284
  <p>We then tried an alternative approach: we deduplicated
285
  each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
286
  tokens of data.</p>
287
  <p>When training on a random sample from this dataset we see
288
  that it now matches RefinedWeb’s performance (blue and red curves below):</p>
289
- <figure class="image"><a
290
- href="plots/cross_ind_unfiltered_comparison.png"><img src="plots/cross_ind_unfiltered_comparison.png"/></a>
291
  </figure>
292
- <p>We hypothesis that the main improvement gained from
293
  deduplication is the removal of very large clusters that are present in every single dump (you will find
294
  some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
295
  documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number
@@ -302,7 +400,7 @@
302
  improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
303
  lower quality data. We also experimented with applying different, and often “lighter”, deduplication
304
  approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
305
- <h3>A note on measuring the effect of deduplication</h3>
306
  <p>Given the nature of deduplication, its effect is not
307
  always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
308
  filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
@@ -310,32 +408,32 @@
310
  <p>To visualize the effect of scaling the number of training
311
  tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic
312
  regarding the degree of duplication observed) theoretical scenario:</p>
313
- <ul class="bulleted-list">
314
- <li style="list-style-type:disc">there are 100 CommonCrawl dumps (actually roughly true)</li>
315
  </ul>
316
- <ul class="bulleted-list">
317
- <li style="list-style-type:disc">each dump has been perfectly individually deduplicated (every single
318
  document in it is unique)
319
  </li>
320
  </ul>
321
- <ul class="bulleted-list">
322
- <li style="list-style-type:disc">each dump is a perfect copy of each other (maximum possible duplication
323
  across dumps, effectively the worst case scenario)
324
  </li>
325
  </ul>
326
- <ul class="bulleted-list">
327
- <li style="list-style-type:disc">each dump has 200 billion tokens (for a total of 20 trillion, the resulting
328
  size of our individual dedup above)
329
  </li>
330
  </ul>
331
- <ul class="bulleted-list">
332
- <li style="list-style-type:disc">each dump is made up of documents of 1k tokens (200M documents per dump)
333
  </li>
334
  </ul>
335
  <p>We then simulated uniformly sampling documents from this
336
  entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
337
  below you can see how often each document would be repeated.</p>
338
- <figure class="image"><a href="plots/dedup_impact_simulation.png"><img src="plots/dedup_impact_simulation.png"/></a></figure>
339
  <p>For 1B almost all documents would be unique
340
  (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
341
  dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
@@ -347,26 +445,26 @@
347
  documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
348
  measuring deduplication impact on the training of LLMs, once the biggest document clusters have been
349
  removed.</p>
350
- <h3>Other (failed) approaches</h3>
351
  <p>We attempted to improve the performance of the
352
  independently minhash deduped 20T of data by further deduplicating it with the following methods</p>
353
- <ul class="bulleted-list">
354
- <li style="list-style-type:disc">URL deduplication, where we only kept one document per normalized
355
  (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
356
  </ul>
357
- <ul class="bulleted-list">
358
- <li style="list-style-type:disc">Line deduplication:
359
- <ul class="bulleted-list">
360
- <li style="list-style-type:circle">remove all but 1 occurrence of each duplicated line (77.8% of
361
  tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
362
  </ul>
363
- <ul class="bulleted-list">
364
- <li style="list-style-type:circle">same as above, but only removing duplicate lines with at least 10
365
  words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
366
  dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
367
  </ul>
368
- <ul class="bulleted-list">
369
- <li style="list-style-type:circle">remove all but 1 occurrence of each span of 3 duplicated lines
370
  with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line
371
  dedup</em></li>
372
  </ul>
@@ -375,8 +473,8 @@
375
  <p>The performance of the models trained on each of these was
376
  consistently worse (even if to different degrees) than that of the original independently deduplicated
377
  data:</p>
378
- <figure class="image"><a href="plots/Untitled.png"><img src="plots/Untitled.png"/></a></figure>
379
- <h2>Additional filtering</h2>
380
  <p>By this point we had reached the same performance as
381
  RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
382
  href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with
@@ -384,7 +482,7 @@
384
  <p>We therefore set out to find new filtering steps that
385
  would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point
386
  was to look into the processing of C4 itself.</p>
387
- <h3>C4: A dataset that has stood the test of time</h3>
388
  <p>The <a href="https://huggingface.co/datasets/c4">C4
389
  dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by
390
  removing non english data, applying some heuristic filters on both the line and document level,
@@ -396,38 +494,38 @@
396
  <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
397
  each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
398
  (plot smoothed with a 3 checkpoints sliding window):</p>
399
- <figure class="image"><a href="plots/c4_filters.png"><img src="plots/c4_filters.png"/></a></figure>
400
- <ul class="bulleted-list">
401
- <li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
402
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
403
  ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
404
  pink curves).
405
  </li>
406
  </ul>
407
- <ul class="bulleted-list">
408
- <li style="list-style-type:disc">The curly bracket filter, and the word lengths filter only give a small
409
  boost, removing 2.8% and 4.3% of tokens, respectively
410
  </li>
411
  </ul>
412
- <ul class="bulleted-list">
413
- <li style="list-style-type:disc">The terminal punctuation filter, by itself, gives the biggest individual
414
  boost, but removes <em>around 30%</em> of all tokens (!)
415
  </li>
416
  </ul>
417
- <ul class="bulleted-list">
418
- <li style="list-style-type:disc">The lorem_ipsum, javascript and policy rules each remove &lt;0.5% of
419
  training tokens, so we did not train on them individually
420
  </li>
421
  </ul>
422
- <ul class="bulleted-list">
423
- <li style="list-style-type:disc">All filters except the very destructive terminal_punct perform better than
424
  terminal_punct by itself, while removing less in total (~7%)
425
  </li>
426
  </ul>
427
  <p>We decided to apply all C4 filters mentioned above except
428
  the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
429
  the next section.</p>
430
- <h3>A statistical approach to develop heuristic filters</h3>
431
  <p>To come up with new possible filtering rules, we collected
432
  a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference
433
  datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
@@ -444,73 +542,72 @@
444
  caused by lower quality data on the full dedup version, we inspected histograms and manually defined
445
  thresholds for the metrics where these differences were starker. This process yielded 17 candidate
446
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
447
- <figure class="image"><a href="plots/Untitled%201.png"><img src="plots/Untitled%201.png"/></a></figure>
448
 
449
  <p>To assess the effectiveness of these newly created
450
  filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
451
  of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
452
  the most significant improvements on the aggregate score:</p>
453
- <ul class="bulleted-list">
454
- <li style="list-style-type:disc">Remove documents where the fraction of lines ending with punctuation ≤ 0.12
455
  (10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
456
  </li>
457
  </ul>
458
- <ul class="bulleted-list">
459
- <li style="list-style-type:disc">Remove documents where the fraction of characters in duplicated lines ≥ 0.1
460
  (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
461
  </li>
462
  </ul>
463
- <ul class="bulleted-list">
464
- <li style="list-style-type:disc">Remove documents where the fraction of lines shorter than 30 characters ≥
465
  0.67 (3.73% of tokens removed)
466
  </li>
467
  </ul>
468
- <ul class="bulleted-list">
469
- <li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
470
  </ul>
471
- <figure class="image"><a href="plots/Untitled%202.png"><img src="plots/Untitled%202.png"/></a></figure>
472
- <hr />
473
- <h1>The final dataset</h1>
474
  <p>The final FineWeb dataset comprises 15T tokens and
475
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
476
  of benchmark tasks:</p>
477
- <ul class="bulleted-list">
478
- <li style="list-style-type:disc">base filtering</li>
479
  </ul>
480
- <ul class="bulleted-list">
481
- <li style="list-style-type:disc">independent MinHash deduplication per dump</li>
482
  </ul>
483
- <ul class="bulleted-list">
484
- <li style="list-style-type:disc">a selection of C4 filters</li>
485
  </ul>
486
- <ul class="bulleted-list">
487
- <li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
488
  </ul>
489
- <figure class="image"><a href="plots/fineweb_all_filters.png"><img src="plots/fineweb_all_filters.png"/></a></figure>
490
  <p>We compared 🍷 FineWeb with the following datasets:</p>
491
- <ul class="bulleted-list">
492
- <li style="list-style-type:disc"><a
493
  href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a>
494
  </li>
495
  </ul>
496
- <ul class="bulleted-list">
497
- <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/c4">C4</a></li>
498
  </ul>
499
- <ul class="bulleted-list">
500
- <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
501
  CommonCrawl part)
502
  </li>
503
  </ul>
504
- <ul class="bulleted-list">
505
- <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a></li>
506
  </ul>
507
- <ul class="bulleted-list">
508
- <li style="list-style-type:disc"><a
509
  href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a>
510
  </li>
511
  </ul>
512
- <ul class="bulleted-list">
513
- <li style="list-style-type:disc"><a
514
  href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a>
515
  (deduplicated)
516
  </li>
@@ -520,13 +617,12 @@
520
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
521
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
522
  results here</a>.</p>
523
- <figure class="image"><a href="plots/fineweb_ablations.png"><img src="plots/fineweb_ablations.png"/></a></figure>
524
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
525
  FineWeb:</p>
526
- <figure class="image"><a href="plots/Untitled%203.png"><img src="plots/Untitled%203.png"/></a></figure>
527
- <hr />
528
- <h1>Just like fine wine, not all crawls are created
529
- equal</h1>
530
  <p>During our ablation runs, we observed that certain crawls
531
  outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
532
  each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
@@ -534,24 +630,24 @@
534
  the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
535
  <p>The plot below clearly shows that some dumps perform far
536
  worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
537
- <figure class="image"><a href="plots/score_by_dump.png"><img src="plots/score_by_dump.png"/></a></figure>
538
  <p>We identified 5 main relevant time intervals:</p>
539
- <ul class="bulleted-list">
540
- <li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
541
  </ul>
542
- <ul class="bulleted-list">
543
- <li style="list-style-type:disc">2017 to 2018: high quality, with a drop by the end of 2018</li>
544
  </ul>
545
- <ul class="bulleted-list">
546
- <li style="list-style-type:disc">2019 to 2021: high quality, steadily increase</li>
547
  </ul>
548
- <ul class="bulleted-list">
549
- <li style="list-style-type:disc">2021-49 and 2022: very large drop in performance, followed by worse quality
550
  dumps
551
  </li>
552
  </ul>
553
- <ul class="bulleted-list">
554
- <li style="list-style-type:disc">2023 and 2024-10: almost exponential improvement. In particular, 2023-50
555
  and 2024-10 are by far the best dumps
556
  </li>
557
  </ul>
@@ -559,14 +655,14 @@
559
  models on &lt; 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
560
  <p>We conducted further analysis to investigate the factors
561
  causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
562
- <ul class="bulleted-list">
563
- <li style="list-style-type:disc">large sudden changes in the list of crawled URLs;</li>
564
  </ul>
565
- <ul class="bulleted-list">
566
- <li style="list-style-type:disc">synthetic (LLM generated) data;</li>
567
  </ul>
568
- <ul class="bulleted-list">
569
- <li style="list-style-type:disc">benchmark contamination;</li>
570
  </ul>
571
  <p>We go over each one in the following sections.</p>
572
  <h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
@@ -576,7 +672,7 @@
576
  crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
577
  it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
578
  that alternatively new FQDNs were added to the top 60k.</p>
579
- <figure class="image"><a href="plots/Untitled%204.png"><img src="plots/Untitled%204.png"/></a></figure>
580
  <p>The data indicates three significant changes:
581
  2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
582
  <p>The explanation for the changes between 2022-33/2022-40
@@ -608,7 +704,7 @@
608
  not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
609
  dumps, one would expect these frequencies to remain approximately constant over time.</p>
610
  <p>The results are shown in the following graph:</p>
611
- <figure class="image"><a href="plots/Untitled%205.png"><img src="plots/Untitled%205.png"/></a></figure>
612
  <p>While the frequency remained approximately constant until
613
  2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
614
  in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
@@ -623,9 +719,8 @@
623
  evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
624
  of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
625
 
626
- <figure class="image"><a href="plots/Untitled%206.png"><img src="plots/Untitled%206.png"/></a></figure>
627
- <hr />
628
- <h1>Next steps</h1>
629
  <p>We want to continue improving FineWeb and will also
630
  release a technical report with more details soon.</p>
631
  <p>Adapting the FineWeb recipe [wip]</p>
@@ -640,4 +735,40 @@
640
 
641
  <d-bibliography src="bibliography.bib"></d-bibliography>
642
  </d-appendix>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
643
  </body>
 
5
  <meta name="viewport" content="width=device-width, initial-scale=1">
6
  <meta charset="utf8">
7
  <title>FineWeb: 15T tokens of high quality web data</title>
8
+ <style>
9
+
10
+ /* ****************************************
11
+ * TOC
12
+ ******************************************/
13
+ @media (max-width: 1199px) {
14
+ d-contents {
15
+ display: none;
16
+ justify-self: start;
17
+ align-self: start;
18
+ padding-bottom: 0.5em;
19
+ margin-bottom: 1em;
20
+ padding-left: 0.25em;
21
+ border-bottom: 1px solid rgba(0, 0, 0, 0.1);
22
+ border-bottom-width: 1px;
23
+ border-bottom-style: solid;
24
+ border-bottom-color: rgba(0, 0, 0, 0.1);
25
+ }
26
+ }
27
+
28
+ d-contents a:hover {
29
+ border-bottom: none;
30
+ }
31
+
32
+
33
+ @media (min-width: 1200px) {
34
+ d-article {
35
+ /* Ensure d-article does not prevent sticky positioning */
36
+ overflow: visible;
37
+ }
38
+
39
+ d-contents {
40
+ align-self: start;
41
+ grid-column-start: 1 !important;
42
+ grid-column-end: 4 !important;
43
+ grid-row: auto / span 6;
44
+ justify-self: end;
45
+ margin-top: 0em;
46
+ padding-right: 3em;
47
+ padding-left: 2em;
48
+ border-right: 1px solid rgba(0, 0, 0, 0.1);
49
+ border-right-width: 1px;
50
+ border-right-style: solid;
51
+ border-right-color: rgba(0, 0, 0, 0.1);
52
+ position: -webkit-sticky; /* For Safari */
53
+ position: sticky;
54
+ top: 0; /* Adjust this value if needed */
55
+ }
56
+ }
57
+
58
+ d-contents nav h3 {
59
+ margin-top: 0;
60
+ margin-bottom: 1em;
61
+ }
62
+
63
+ d-contents nav div {
64
+ color: rgba(0, 0, 0, 0.8);
65
+ font-weight: bold;
66
+ }
67
+
68
+ d-contents nav a {
69
+ color: rgba(0, 0, 0, 0.8);
70
+ border-bottom: none;
71
+ text-decoration: none;
72
+ }
73
+
74
+ d-contents li {
75
+ list-style-type: none;
76
+ }
77
+
78
+ d-contents ul, d-article d-contents ul {
79
+ padding-left: 1em;
80
+ }
81
+
82
+ d-contents nav ul li {
83
+ margin-bottom: .25em;
84
+ }
85
+
86
+ d-contents nav a:hover {
87
+ text-decoration: underline solid rgba(0, 0, 0, 0.6);
88
+ }
89
+
90
+ d-contents nav ul {
91
+ margin-top: 0;
92
+ margin-bottom: 6px;
93
+ }
94
+
95
+
96
+ d-contents nav > div {
97
+ display: block;
98
+ outline: none;
99
+ margin-bottom: 0.5em;
100
+ }
101
+
102
+ d-contents nav > div > a {
103
+ font-size: 13px;
104
+ font-weight: 600;
105
+ }
106
+
107
+ d-contents nav > div > a:hover,
108
+ d-contents nav > ul > li > a:hover {
109
+ text-decoration: none;
110
+ }
111
+
112
+ </style>
113
  </head>
114
 
115
  <body>
 
149
  <figure style="grid-column: page; mix-blend-mode: multiply;">
150
  <img src="banner.png" alt="FineWeb">
151
  </figure>
 
 
 
152
  </d-title>
153
  <d-byline></d-byline>
154
  <d-article>
155
+ <d-contents>
156
+ </d-contents>
157
+
158
  <p>We have recently released 🍷FineWeb, our new large scale
159
  (15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
160
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
 
164
  <p><strong>TLDR:</strong> This blog covers the FineWeb
165
  recipe, why more deduplication is not always better and some interesting findings on the difference in
166
  quality of CommonCrawl dumps.</p>
167
+
168
+ <h2>Preamble</h2>
169
+ <h3>Sourcing the data</h3>
170
  <p>A common question we see asked regarding web datasets used
171
  to train LLMs is “where do they even get all that data?” There are generally two options:</p>
172
+ <ul>
173
+ <li>you either crawl it yourself, like <a
174
  href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
175
  href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
176
  </li>
177
  </ul>
178
+ <ul>
179
+ <li>you use a public repository of crawled webpages, like the one maintained by
180
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
181
  </ul>
182
  <p>For FineWeb, similarly to what was done for a large number
 
186
  <p>As an example, their latest crawl (2024-10) contains 3.16
187
  billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There
188
  are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p>
189
+ <h3>Processing at scale</h3>
190
  <p>Given the sheer size of the data involved, one of the main
191
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
192
  on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads
 
194
  <p>For this purpose, we developed <a
195
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data
196
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
197
+ CPU cores. All the data processing steps involved in the creation of FineWeb used this <a
198
  href="https://github.com/huggingface/datatrove">library</a>.</p>
199
+ <h3>What is clean, good data?</h3>
200
  <p>This is probably the main question to keep in mind when
201
  creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a
202
  human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p>
 
232
  href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
233
  benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
234
  billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
235
+ <ul>
236
+ <li>small variance between runs trained on different samplings of the same
237
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
238
  resulting scores to have as little noise as possible
239
  </li>
240
  </ul>
241
+ <ul>
242
+ <li>performance increasing monotonically (or close) over a training run:
243
  ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
244
  (should not be too noisy)
245
  </li>
 
248
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To
249
  have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
250
  min on a single node of 8 GPUs - done in parallel to the training).</p>
251
+ <h2>The FineWeb recipe</h2>
 
252
  <p>In the next subsections we will explain each of the steps
253
  taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a
254
  href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p>
255
+ <figure class="l-body">
 
 
 
256
  <img src="plots/fineweb-recipe.png"/>
257
  </figure>
258
+ <h3>Starting point: text extraction</h3>
259
  <p>CommonCrawl data is available in two main formats: WARC
260
  and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
261
  full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
 
274
  resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
275
  quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
276
  these additional tokens on the WET files are unnecessary page boilerplate.</p>
277
+ <figure><img src="plots/wet_comparison.png"/></figure>
278
 
279
+ <h3>Base filtering</h3>
280
  <p>Filtering is an important part of the curation process. It
281
  removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
282
  deemed to be “lower quality”.</p>
283
  <p>As a basis for our filtering we used part of the setup
284
  from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p>
285
+ <ul>
286
+ <li>Applied URL filtering using a <a
287
  href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content
288
  </li>
289
  </ul>
290
+ <ul>
291
+ <li>Applied a <a
292
  href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to
293
  keep only English text with a score ≥ 0.65
294
  </li>
295
  </ul>
296
+ <ul>
297
+ <li>Applied quality and repetition filters from the <a
298
  href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds)
299
  </li>
300
  </ul>
301
  <p>After applying this filtering to each of the text
302
  extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
303
  tokenized with the <code>gpt2</code> tokenizer).</p>
304
+ <h3>Deduplication</h3>
305
  <p>Deduplication is another important step, specially for web
306
  datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of
307
  the most important steps when creating large web datasets for LLMs.</p>
308
+ <h4>Why deduplicate?</h4>
309
  <p>The web has many aggregators, mirrors, templated pages or
310
  just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
311
  can be introduced by the crawler itself, when different links point to the same page. </p>
 
330
  92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability
331
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
332
  buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
333
+ <figure><img src="plots/minhash_parameters_comparison.png"/>
 
334
  </figure>
335
  <p>While the high number of hash functions in RefinedWeb
336
  allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
 
350
  trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
351
  tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
352
  green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
353
+ <figure><img src="plots/dedup_all_dumps_bad.png"/></figure>
354
  <p>This was quite puzzling as our intuition regarding web
355
  data was that more deduplication would always result in improved performance. We decided to take a closer
356
  look at one of the oldest dumps, dump 2013-48:</p>
357
+ <ul>
358
+ <li>pre deduplication, this dump had ~490 billion tokens</li>
359
  </ul>
360
+ <ul>
361
+ <li>after our iterative MinHash, ~31 billion tokens remained (94% of data
362
  removed)
363
  </li>
364
  </ul>
365
  <p>As an experiment, we tried training two models on 28BT
366
  sampled from the following data from 2013-48:</p>
367
+ <ul>
368
+ <li>the fully deduplicated remaining ~31 billion tokens (<em>originally kept
369
  data</em>)
370
  </li>
371
  </ul>
372
+ <ul>
373
+ <li>171 billion tokens obtained by individually deduplicating (without
374
  considering the other dumps) the ~460 billion tokens that had been removed from this dump in the
375
  iterative dedup process (<em>originally removed data</em>)
376
  </li>
377
  </ul>
378
+ <figure><img src="plots/removed_data_cross_dedup.png"/></figure>
 
379
  <p>These results show that, for this older dump where we were
380
  removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
381
+ removed (considered independently of all the other dumps).</p>
382
  <h3>Taking a step back: individual dump dedup</h3>
383
  <p>We then tried an alternative approach: we deduplicated
384
  each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
385
  tokens of data.</p>
386
  <p>When training on a random sample from this dataset we see
387
  that it now matches RefinedWeb’s performance (blue and red curves below):</p>
388
+ <figure><img src="plots/cross_ind_unfiltered_comparison.png"/>
 
389
  </figure>
390
+ <p>We hypothesize that the main improvement gained from
391
  deduplication is the removal of very large clusters that are present in every single dump (you will find
392
  some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
393
  documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number
 
400
  improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
401
  lower quality data. We also experimented with applying different, and often “lighter”, deduplication
402
  approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
403
+ <h4>A note on measuring the effect of deduplication</h4>
404
  <p>Given the nature of deduplication, its effect is not
405
  always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
406
  filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
 
408
  <p>To visualize the effect of scaling the number of training
409
  tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic
410
  regarding the degree of duplication observed) theoretical scenario:</p>
411
+ <ul>
412
+ <li>there are 100 CommonCrawl dumps (actually roughly true)</li>
413
  </ul>
414
+ <ul>
415
+ <li>each dump has been perfectly individually deduplicated (every single
416
  document in it is unique)
417
  </li>
418
  </ul>
419
+ <ul>
420
+ <li>each dump is a perfect copy of each other (maximum possible duplication
421
  across dumps, effectively the worst case scenario)
422
  </li>
423
  </ul>
424
+ <ul>
425
+ <li>each dump has 200 billion tokens (for a total of 20 trillion, the resulting
426
  size of our individual dedup above)
427
  </li>
428
  </ul>
429
+ <ul>
430
+ <li>each dump is made up of documents of 1k tokens (200M documents per dump)
431
  </li>
432
  </ul>
433
  <p>We then simulated uniformly sampling documents from this
434
  entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
435
  below you can see how often each document would be repeated.</p>
436
+ <figure><img src="plots/dedup_impact_simulation.png"/></figure>
437
  <p>For 1B almost all documents would be unique
438
  (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
439
  dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
 
445
  documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
446
  measuring deduplication impact on the training of LLMs, once the biggest document clusters have been
447
  removed.</p>
448
+ <h4>Other (failed) approaches</h4>
449
  <p>We attempted to improve the performance of the
450
  independently minhash deduped 20T of data by further deduplicating it with the following methods</p>
451
+ <ul>
452
+ <li>URL deduplication, where we only kept one document per normalized
453
  (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
454
  </ul>
455
+ <ul>
456
+ <li>Line deduplication:
457
+ <ul>
458
+ <li>remove all but 1 occurrence of each duplicated line (77.8% of
459
  tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
460
  </ul>
461
+ <ul>
462
+ <li>same as above, but only removing duplicate lines with at least 10
463
  words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
464
  dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
465
  </ul>
466
+ <ul>
467
+ <li>remove all but 1 occurrence of each span of 3 duplicated lines
468
  with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line
469
  dedup</em></li>
470
  </ul>
 
473
  <p>The performance of the models trained on each of these was
474
  consistently worse (even if to different degrees) than that of the original independently deduplicated
475
  data:</p>
476
+ <figure><img src="plots/Untitled.png"/></figure>
477
+ <h3>Additional filtering</h3>
478
  <p>By this point we had reached the same performance as
479
  RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
480
  href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with
 
482
  <p>We therefore set out to find new filtering steps that
483
  would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point
484
  was to look into the processing of C4 itself.</p>
485
+ <h4>C4: A dataset that has stood the test of time</h4>
486
  <p>The <a href="https://huggingface.co/datasets/c4">C4
487
  dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by
488
  removing non english data, applying some heuristic filters on both the line and document level,
 
494
  <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
495
  each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
496
  (plot smoothed with a 3 checkpoints sliding window):</p>
497
+ <figure><img src="plots/c4_filters.png"/></figure>
498
+ <ul>
499
+ <li>applying “All filters” (drop lines not ending on punctuation marks,
500
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
501
  ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
502
  pink curves).
503
  </li>
504
  </ul>
505
+ <ul>
506
+ <li>The curly bracket filter, and the word lengths filter only give a small
507
  boost, removing 2.8% and 4.3% of tokens, respectively
508
  </li>
509
  </ul>
510
+ <ul>
511
+ <li>The terminal punctuation filter, by itself, gives the biggest individual
512
  boost, but removes <em>around 30%</em> of all tokens (!)
513
  </li>
514
  </ul>
515
+ <ul>
516
+ <li>The lorem_ipsum, javascript and policy rules each remove &lt;0.5% of
517
  training tokens, so we did not train on them individually
518
  </li>
519
  </ul>
520
+ <ul>
521
+ <li>All filters except the very destructive terminal_punct perform better than
522
  terminal_punct by itself, while removing less in total (~7%)
523
  </li>
524
  </ul>
525
  <p>We decided to apply all C4 filters mentioned above except
526
  the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
527
  the next section.</p>
528
+ <h4>A statistical approach to develop heuristic filters</h4>
529
  <p>To come up with new possible filtering rules, we collected
530
  a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference
531
  datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
 
542
  caused by lower quality data on the full dedup version, we inspected histograms and manually defined
543
  thresholds for the metrics where these differences were starker. This process yielded 17 candidate
544
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
545
+ <figure><img src="plots/Untitled%201.png"/></figure>
546
 
547
  <p>To assess the effectiveness of these newly created
548
  filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
549
  of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
550
  the most significant improvements on the aggregate score:</p>
551
+ <ul>
552
+ <li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
553
  (10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
554
  </li>
555
  </ul>
556
+ <ul>
557
+ <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
558
  (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
559
  </li>
560
  </ul>
561
+ <ul>
562
+ <li>Remove documents where the fraction of lines shorter than 30 characters ≥
563
  0.67 (3.73% of tokens removed)
564
  </li>
565
  </ul>
566
+ <ul>
567
+ <li>When applying the 3 together, ~22% of tokens were removed</li>
568
  </ul>
569
+ <figure><img src="plots/Untitled%202.png"/></figure>
570
+ <h2>The final dataset</h2>
 
571
  <p>The final FineWeb dataset comprises 15T tokens and
572
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
573
  of benchmark tasks:</p>
574
+ <ul>
575
+ <li>base filtering</li>
576
  </ul>
577
+ <ul>
578
+ <li>independent MinHash deduplication per dump</li>
579
  </ul>
580
+ <ul>
581
+ <li>a selection of C4 filters</li>
582
  </ul>
583
+ <ul>
584
+ <li>our custom filters (mentioned in the previous section)</li>
585
  </ul>
586
+ <figure><img src="plots/fineweb_all_filters.png"/></figure>
587
  <p>We compared 🍷 FineWeb with the following datasets:</p>
588
+ <ul>
589
+ <li><a
590
  href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a>
591
  </li>
592
  </ul>
593
+ <ul>
594
+ <li><a href="https://huggingface.co/datasets/allenai/c4">C4</a></li>
595
  </ul>
596
+ <ul>
597
+ <li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
598
  CommonCrawl part)
599
  </li>
600
  </ul>
601
+ <ul>
602
+ <li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a></li>
603
  </ul>
604
+ <ul>
605
+ <li><a
606
  href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a>
607
  </li>
608
  </ul>
609
+ <ul>
610
+ <li><a
611
  href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a>
612
  (deduplicated)
613
  </li>
 
617
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
618
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
619
  results here</a>.</p>
620
+ <figure><img src="plots/fineweb_ablations.png"/></figure>
621
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
622
  FineWeb:</p>
623
+ <figure><img src="plots/Untitled%203.png"/></figure>
624
+ <h2>Just like fine wine, not all crawls are created
625
+ equal</h2>
 
626
  <p>During our ablation runs, we observed that certain crawls
627
  outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
628
  each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
 
630
  the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
631
  <p>The plot below clearly shows that some dumps perform far
632
  worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
633
+ <figure><img src="plots/score_by_dump.png"/></figure>
634
  <p>We identified 5 main relevant time intervals:</p>
635
+ <ul>
636
+ <li>2013 to 2016: relatively stable, average quality</li>
637
  </ul>
638
+ <ul>
639
+ <li>2017 to 2018: high quality, with a drop by the end of 2018</li>
640
  </ul>
641
+ <ul>
642
+ <li>2019 to 2021: high quality, steadily increase</li>
643
  </ul>
644
+ <ul>
645
+ <li>2021-49 and 2022: very large drop in performance, followed by worse quality
646
  dumps
647
  </li>
648
  </ul>
649
+ <ul>
650
+ <li>2023 and 2024-10: almost exponential improvement. In particular, 2023-50
651
  and 2024-10 are by far the best dumps
652
  </li>
653
  </ul>
 
655
  models on &lt; 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
656
  <p>We conducted further analysis to investigate the factors
657
  causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
658
+ <ul>
659
+ <li>large sudden changes in the list of crawled URLs;</li>
660
  </ul>
661
+ <ul>
662
+ <li>synthetic (LLM generated) data;</li>
663
  </ul>
664
+ <ul>
665
+ <li>benchmark contamination;</li>
666
  </ul>
667
  <p>We go over each one in the following sections.</p>
668
  <h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
 
672
  crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
673
  it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
674
  that alternatively new FQDNs were added to the top 60k.</p>
675
+ <figure><img src="plots/Untitled%204.png"/></figure>
676
  <p>The data indicates three significant changes:
677
  2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
678
  <p>The explanation for the changes between 2022-33/2022-40
 
704
  not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
705
  dumps, one would expect these frequencies to remain approximately constant over time.</p>
706
  <p>The results are shown in the following graph:</p>
707
+ <figure><img src="plots/Untitled%205.png"/></figure>
708
  <p>While the frequency remained approximately constant until
709
  2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
710
  in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
 
719
  evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
720
  of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
721
 
722
+ <figure><img src="plots/Untitled%206.png"/></figure>
723
+ <h2>Next steps</h2>
 
724
  <p>We want to continue improving FineWeb and will also
725
  release a technical report with more details soon.</p>
726
  <p>Adapting the FineWeb recipe [wip]</p>
 
735
 
736
  <d-bibliography src="bibliography.bib"></d-bibliography>
737
  </d-appendix>
738
+
739
+ <script>
740
+ const article = document.querySelector('d-article');
741
+ const toc = document.querySelector('d-contents');
742
+ if (toc) {
743
+ const headings = article.querySelectorAll('h2, h3');
744
+ let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
745
+
746
+ let elements = [];
747
+ for (const el of headings) {
748
+ // should element be included in TOC?
749
+ const isInTitle = el.parentElement.tagName == 'D-TITLE';
750
+ const isException = el.getAttribute('no-toc');
751
+ if (isInTitle || isException) continue;
752
+ el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
753
+ const link = '<a href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
754
+ if (el.tagName === 'H2')
755
+ elements.push([link, []]);
756
+ else {
757
+ if (elements.length === 0)
758
+ elements.push([null, []])
759
+ elements[elements.length - 1][1].push(link)
760
+ }
761
+ }
762
+
763
+ for (const topLevel of elements) {
764
+ ToC += '<div>' + topLevel[0] + '</div><ul>';
765
+ for (const subLevel of topLevel[1])
766
+ ToC += '<li>' + subLevel + '</li>';
767
+ ToC += '</ul>';
768
+ }
769
+ ToC += '</nav>';
770
+ toc.innerHTML = ToC;
771
+ toc.setAttribute('prerendered', 'true');
772
+ }
773
+ </script>
774
  </body>