added ToC and other changes
Browse files- README.md +0 -2
- index.html +280 -149
README.md
CHANGED
@@ -7,5 +7,3 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
header: mini
|
9 |
---
|
10 |
-
|
11 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
7 |
pinned: false
|
8 |
header: mini
|
9 |
---
|
|
|
|
index.html
CHANGED
@@ -5,6 +5,111 @@
|
|
5 |
<meta name="viewport" content="width=device-width, initial-scale=1">
|
6 |
<meta charset="utf8">
|
7 |
<title>FineWeb: 15T tokens of high quality web data</title>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
</head>
|
9 |
|
10 |
<body>
|
@@ -44,12 +149,12 @@
|
|
44 |
<figure style="grid-column: page; mix-blend-mode: multiply;">
|
45 |
<img src="banner.png" alt="FineWeb">
|
46 |
</figure>
|
47 |
-
<!-- <figure style="grid-column: page; margin: 1rem 0;"><img src="banner.png"-->
|
48 |
-
<!-- style="width:100%; border: 1px solid rgba(0, 0, 0, 0.2);"/>-->
|
49 |
-
<!-- </figure>-->
|
50 |
</d-title>
|
51 |
<d-byline></d-byline>
|
52 |
<d-article>
|
|
|
|
|
|
|
53 |
<p>We have recently released 🍷FineWeb, our new large scale
|
54 |
(15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
|
55 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
|
@@ -59,19 +164,19 @@
|
|
59 |
<p><strong>TLDR:</strong> This blog covers the FineWeb
|
60 |
recipe, why more deduplication is not always better and some interesting findings on the difference in
|
61 |
quality of CommonCrawl dumps.</p>
|
62 |
-
|
63 |
-
<
|
64 |
-
<
|
65 |
<p>A common question we see asked regarding web datasets used
|
66 |
to train LLMs is “where do they even get all that data?” There are generally two options:</p>
|
67 |
-
<ul
|
68 |
-
<li
|
69 |
href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
|
70 |
href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
|
71 |
</li>
|
72 |
</ul>
|
73 |
-
<ul
|
74 |
-
<li
|
75 |
the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
|
76 |
</ul>
|
77 |
<p>For FineWeb, similarly to what was done for a large number
|
@@ -81,7 +186,7 @@
|
|
81 |
<p>As an example, their latest crawl (2024-10) contains 3.16
|
82 |
billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There
|
83 |
are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p>
|
84 |
-
<
|
85 |
<p>Given the sheer size of the data involved, one of the main
|
86 |
challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
|
87 |
on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads
|
@@ -89,9 +194,9 @@
|
|
89 |
<p>For this purpose, we developed <a
|
90 |
href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data
|
91 |
processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
|
92 |
-
CPU cores. All
|
93 |
href="https://github.com/huggingface/datatrove">library</a>.</p>
|
94 |
-
<
|
95 |
<p>This is probably the main question to keep in mind when
|
96 |
creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a
|
97 |
human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p>
|
@@ -127,14 +232,14 @@
|
|
127 |
href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
|
128 |
benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
|
129 |
billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
|
130 |
-
<ul
|
131 |
-
<li
|
132 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
133 |
resulting scores to have as little noise as possible
|
134 |
</li>
|
135 |
</ul>
|
136 |
-
<ul
|
137 |
-
<li
|
138 |
ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
|
139 |
(should not be too noisy)
|
140 |
</li>
|
@@ -143,18 +248,14 @@
|
|
143 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To
|
144 |
have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
|
145 |
min on a single node of 8 GPUs - done in parallel to the training).</p>
|
146 |
-
<
|
147 |
-
<h1>The FineWeb recipe</h1>
|
148 |
<p>In the next subsections we will explain each of the steps
|
149 |
taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a
|
150 |
href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p>
|
151 |
-
<
|
152 |
-
.neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
|
153 |
-
</style>
|
154 |
-
<figure class="l-body figure">
|
155 |
<img src="plots/fineweb-recipe.png"/>
|
156 |
</figure>
|
157 |
-
<
|
158 |
<p>CommonCrawl data is available in two main formats: WARC
|
159 |
and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
|
160 |
full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
|
@@ -173,38 +274,38 @@
|
|
173 |
resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
|
174 |
quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
|
175 |
these additional tokens on the WET files are unnecessary page boilerplate.</p>
|
176 |
-
<figure
|
177 |
|
178 |
-
<
|
179 |
<p>Filtering is an important part of the curation process. It
|
180 |
removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
|
181 |
deemed to be “lower quality”.</p>
|
182 |
<p>As a basis for our filtering we used part of the setup
|
183 |
from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p>
|
184 |
-
<ul
|
185 |
-
<li
|
186 |
href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content
|
187 |
</li>
|
188 |
</ul>
|
189 |
-
<ul
|
190 |
-
<li
|
191 |
href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to
|
192 |
keep only English text with a score ≥ 0.65
|
193 |
</li>
|
194 |
</ul>
|
195 |
-
<ul
|
196 |
-
<li
|
197 |
href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds)
|
198 |
</li>
|
199 |
</ul>
|
200 |
<p>After applying this filtering to each of the text
|
201 |
extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
|
202 |
tokenized with the <code>gpt2</code> tokenizer).</p>
|
203 |
-
<
|
204 |
<p>Deduplication is another important step, specially for web
|
205 |
datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of
|
206 |
the most important steps when creating large web datasets for LLMs.</p>
|
207 |
-
<
|
208 |
<p>The web has many aggregators, mirrors, templated pages or
|
209 |
just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
|
210 |
can be introduced by the crawler itself, when different links point to the same page. </p>
|
@@ -229,8 +330,7 @@
|
|
229 |
92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability
|
230 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
231 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
|
232 |
-
<figure
|
233 |
-
href="plots/minhash_parameters_comparison.png"><img src="plots/minhash_parameters_comparison.png"/></a>
|
234 |
</figure>
|
235 |
<p>While the high number of hash functions in RefinedWeb
|
236 |
allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
|
@@ -250,46 +350,44 @@
|
|
250 |
trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
|
251 |
tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
|
252 |
green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
|
253 |
-
<figure
|
254 |
<p>This was quite puzzling as our intuition regarding web
|
255 |
data was that more deduplication would always result in improved performance. We decided to take a closer
|
256 |
look at one of the oldest dumps, dump 2013-48:</p>
|
257 |
-
<ul
|
258 |
-
<li
|
259 |
</ul>
|
260 |
-
<ul
|
261 |
-
<li
|
262 |
removed)
|
263 |
</li>
|
264 |
</ul>
|
265 |
<p>As an experiment, we tried training two models on 28BT
|
266 |
sampled from the following data from 2013-48:</p>
|
267 |
-
<ul
|
268 |
-
<li
|
269 |
data</em>)
|
270 |
</li>
|
271 |
</ul>
|
272 |
-
<ul
|
273 |
-
<li
|
274 |
considering the other dumps) the ~460 billion tokens that had been removed from this dump in the
|
275 |
iterative dedup process (<em>originally removed data</em>)
|
276 |
</li>
|
277 |
</ul>
|
278 |
-
<figure
|
279 |
-
href="plots/removed_data_cross_dedup.png"><img src="plots/removed_data_cross_dedup.png"/></a></figure>
|
280 |
<p>These results show that, for this older dump where we were
|
281 |
removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
|
282 |
-
removed (considered independently
|
283 |
<h3>Taking a step back: individual dump dedup</h3>
|
284 |
<p>We then tried an alternative approach: we deduplicated
|
285 |
each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
|
286 |
tokens of data.</p>
|
287 |
<p>When training on a random sample from this dataset we see
|
288 |
that it now matches RefinedWeb’s performance (blue and red curves below):</p>
|
289 |
-
<figure
|
290 |
-
href="plots/cross_ind_unfiltered_comparison.png"><img src="plots/cross_ind_unfiltered_comparison.png"/></a>
|
291 |
</figure>
|
292 |
-
<p>We
|
293 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
294 |
some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
|
295 |
documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number
|
@@ -302,7 +400,7 @@
|
|
302 |
improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
|
303 |
lower quality data. We also experimented with applying different, and often “lighter”, deduplication
|
304 |
approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
|
305 |
-
<
|
306 |
<p>Given the nature of deduplication, its effect is not
|
307 |
always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
|
308 |
filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
|
@@ -310,32 +408,32 @@
|
|
310 |
<p>To visualize the effect of scaling the number of training
|
311 |
tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic
|
312 |
regarding the degree of duplication observed) theoretical scenario:</p>
|
313 |
-
<ul
|
314 |
-
<li
|
315 |
</ul>
|
316 |
-
<ul
|
317 |
-
<li
|
318 |
document in it is unique)
|
319 |
</li>
|
320 |
</ul>
|
321 |
-
<ul
|
322 |
-
<li
|
323 |
across dumps, effectively the worst case scenario)
|
324 |
</li>
|
325 |
</ul>
|
326 |
-
<ul
|
327 |
-
<li
|
328 |
size of our individual dedup above)
|
329 |
</li>
|
330 |
</ul>
|
331 |
-
<ul
|
332 |
-
<li
|
333 |
</li>
|
334 |
</ul>
|
335 |
<p>We then simulated uniformly sampling documents from this
|
336 |
entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
|
337 |
below you can see how often each document would be repeated.</p>
|
338 |
-
<figure
|
339 |
<p>For 1B almost all documents would be unique
|
340 |
(#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
|
341 |
dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
|
@@ -347,26 +445,26 @@
|
|
347 |
documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
|
348 |
measuring deduplication impact on the training of LLMs, once the biggest document clusters have been
|
349 |
removed.</p>
|
350 |
-
<
|
351 |
<p>We attempted to improve the performance of the
|
352 |
independently minhash deduped 20T of data by further deduplicating it with the following methods</p>
|
353 |
-
<ul
|
354 |
-
<li
|
355 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
|
356 |
</ul>
|
357 |
-
<ul
|
358 |
-
<li
|
359 |
-
<ul
|
360 |
-
<li
|
361 |
tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
|
362 |
</ul>
|
363 |
-
<ul
|
364 |
-
<li
|
365 |
words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
|
366 |
dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
|
367 |
</ul>
|
368 |
-
<ul
|
369 |
-
<li
|
370 |
with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line
|
371 |
dedup</em></li>
|
372 |
</ul>
|
@@ -375,8 +473,8 @@
|
|
375 |
<p>The performance of the models trained on each of these was
|
376 |
consistently worse (even if to different degrees) than that of the original independently deduplicated
|
377 |
data:</p>
|
378 |
-
<figure
|
379 |
-
<
|
380 |
<p>By this point we had reached the same performance as
|
381 |
RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
|
382 |
href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with
|
@@ -384,7 +482,7 @@
|
|
384 |
<p>We therefore set out to find new filtering steps that
|
385 |
would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point
|
386 |
was to look into the processing of C4 itself.</p>
|
387 |
-
<
|
388 |
<p>The <a href="https://huggingface.co/datasets/c4">C4
|
389 |
dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by
|
390 |
removing non english data, applying some heuristic filters on both the line and document level,
|
@@ -396,38 +494,38 @@
|
|
396 |
<a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
|
397 |
each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
|
398 |
(plot smoothed with a 3 checkpoints sliding window):</p>
|
399 |
-
<figure
|
400 |
-
<ul
|
401 |
-
<li
|
402 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
403 |
ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
|
404 |
pink curves).
|
405 |
</li>
|
406 |
</ul>
|
407 |
-
<ul
|
408 |
-
<li
|
409 |
boost, removing 2.8% and 4.3% of tokens, respectively
|
410 |
</li>
|
411 |
</ul>
|
412 |
-
<ul
|
413 |
-
<li
|
414 |
boost, but removes <em>around 30%</em> of all tokens (!)
|
415 |
</li>
|
416 |
</ul>
|
417 |
-
<ul
|
418 |
-
<li
|
419 |
training tokens, so we did not train on them individually
|
420 |
</li>
|
421 |
</ul>
|
422 |
-
<ul
|
423 |
-
<li
|
424 |
terminal_punct by itself, while removing less in total (~7%)
|
425 |
</li>
|
426 |
</ul>
|
427 |
<p>We decided to apply all C4 filters mentioned above except
|
428 |
the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
|
429 |
the next section.</p>
|
430 |
-
<
|
431 |
<p>To come up with new possible filtering rules, we collected
|
432 |
a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference
|
433 |
datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
|
@@ -444,73 +542,72 @@
|
|
444 |
caused by lower quality data on the full dedup version, we inspected histograms and manually defined
|
445 |
thresholds for the metrics where these differences were starker. This process yielded 17 candidate
|
446 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
447 |
-
<figure
|
448 |
|
449 |
<p>To assess the effectiveness of these newly created
|
450 |
filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
|
451 |
of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
|
452 |
the most significant improvements on the aggregate score:</p>
|
453 |
-
<ul
|
454 |
-
<li
|
455 |
(10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
|
456 |
</li>
|
457 |
</ul>
|
458 |
-
<ul
|
459 |
-
<li
|
460 |
(12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
|
461 |
</li>
|
462 |
</ul>
|
463 |
-
<ul
|
464 |
-
<li
|
465 |
0.67 (3.73% of tokens removed)
|
466 |
</li>
|
467 |
</ul>
|
468 |
-
<ul
|
469 |
-
<li
|
470 |
</ul>
|
471 |
-
<figure
|
472 |
-
<
|
473 |
-
<h1>The final dataset</h1>
|
474 |
<p>The final FineWeb dataset comprises 15T tokens and
|
475 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
476 |
of benchmark tasks:</p>
|
477 |
-
<ul
|
478 |
-
<li
|
479 |
</ul>
|
480 |
-
<ul
|
481 |
-
<li
|
482 |
</ul>
|
483 |
-
<ul
|
484 |
-
<li
|
485 |
</ul>
|
486 |
-
<ul
|
487 |
-
<li
|
488 |
</ul>
|
489 |
-
<figure
|
490 |
<p>We compared 🍷 FineWeb with the following datasets:</p>
|
491 |
-
<ul
|
492 |
-
<li
|
493 |
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a>
|
494 |
</li>
|
495 |
</ul>
|
496 |
-
<ul
|
497 |
-
<li
|
498 |
</ul>
|
499 |
-
<ul
|
500 |
-
<li
|
501 |
CommonCrawl part)
|
502 |
</li>
|
503 |
</ul>
|
504 |
-
<ul
|
505 |
-
<li
|
506 |
</ul>
|
507 |
-
<ul
|
508 |
-
<li
|
509 |
href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a>
|
510 |
</li>
|
511 |
</ul>
|
512 |
-
<ul
|
513 |
-
<li
|
514 |
href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a>
|
515 |
(deduplicated)
|
516 |
</li>
|
@@ -520,13 +617,12 @@
|
|
520 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
521 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
522 |
results here</a>.</p>
|
523 |
-
<figure
|
524 |
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and
|
525 |
FineWeb:</p>
|
526 |
-
<figure
|
527 |
-
<
|
528 |
-
|
529 |
-
equal</h1>
|
530 |
<p>During our ablation runs, we observed that certain crawls
|
531 |
outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
|
532 |
each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
|
@@ -534,24 +630,24 @@
|
|
534 |
the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
|
535 |
<p>The plot below clearly shows that some dumps perform far
|
536 |
worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
|
537 |
-
<figure
|
538 |
<p>We identified 5 main relevant time intervals:</p>
|
539 |
-
<ul
|
540 |
-
<li
|
541 |
</ul>
|
542 |
-
<ul
|
543 |
-
<li
|
544 |
</ul>
|
545 |
-
<ul
|
546 |
-
<li
|
547 |
</ul>
|
548 |
-
<ul
|
549 |
-
<li
|
550 |
dumps
|
551 |
</li>
|
552 |
</ul>
|
553 |
-
<ul
|
554 |
-
<li
|
555 |
and 2024-10 are by far the best dumps
|
556 |
</li>
|
557 |
</ul>
|
@@ -559,14 +655,14 @@
|
|
559 |
models on < 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
|
560 |
<p>We conducted further analysis to investigate the factors
|
561 |
causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
|
562 |
-
<ul
|
563 |
-
<li
|
564 |
</ul>
|
565 |
-
<ul
|
566 |
-
<li
|
567 |
</ul>
|
568 |
-
<ul
|
569 |
-
<li
|
570 |
</ul>
|
571 |
<p>We go over each one in the following sections.</p>
|
572 |
<h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
|
@@ -576,7 +672,7 @@
|
|
576 |
crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
|
577 |
it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
|
578 |
that alternatively new FQDNs were added to the top 60k.</p>
|
579 |
-
<figure
|
580 |
<p>The data indicates three significant changes:
|
581 |
2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
|
582 |
<p>The explanation for the changes between 2022-33/2022-40
|
@@ -608,7 +704,7 @@
|
|
608 |
not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
|
609 |
dumps, one would expect these frequencies to remain approximately constant over time.</p>
|
610 |
<p>The results are shown in the following graph:</p>
|
611 |
-
<figure
|
612 |
<p>While the frequency remained approximately constant until
|
613 |
2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
|
614 |
in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
|
@@ -623,9 +719,8 @@
|
|
623 |
evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
|
624 |
of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
|
625 |
|
626 |
-
<figure
|
627 |
-
<
|
628 |
-
<h1>Next steps</h1>
|
629 |
<p>We want to continue improving FineWeb and will also
|
630 |
release a technical report with more details soon.</p>
|
631 |
<p>Adapting the FineWeb recipe [wip]</p>
|
@@ -640,4 +735,40 @@
|
|
640 |
|
641 |
<d-bibliography src="bibliography.bib"></d-bibliography>
|
642 |
</d-appendix>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
643 |
</body>
|
|
|
5 |
<meta name="viewport" content="width=device-width, initial-scale=1">
|
6 |
<meta charset="utf8">
|
7 |
<title>FineWeb: 15T tokens of high quality web data</title>
|
8 |
+
<style>
|
9 |
+
|
10 |
+
/* ****************************************
|
11 |
+
* TOC
|
12 |
+
******************************************/
|
13 |
+
@media (max-width: 1199px) {
|
14 |
+
d-contents {
|
15 |
+
display: none;
|
16 |
+
justify-self: start;
|
17 |
+
align-self: start;
|
18 |
+
padding-bottom: 0.5em;
|
19 |
+
margin-bottom: 1em;
|
20 |
+
padding-left: 0.25em;
|
21 |
+
border-bottom: 1px solid rgba(0, 0, 0, 0.1);
|
22 |
+
border-bottom-width: 1px;
|
23 |
+
border-bottom-style: solid;
|
24 |
+
border-bottom-color: rgba(0, 0, 0, 0.1);
|
25 |
+
}
|
26 |
+
}
|
27 |
+
|
28 |
+
d-contents a:hover {
|
29 |
+
border-bottom: none;
|
30 |
+
}
|
31 |
+
|
32 |
+
|
33 |
+
@media (min-width: 1200px) {
|
34 |
+
d-article {
|
35 |
+
/* Ensure d-article does not prevent sticky positioning */
|
36 |
+
overflow: visible;
|
37 |
+
}
|
38 |
+
|
39 |
+
d-contents {
|
40 |
+
align-self: start;
|
41 |
+
grid-column-start: 1 !important;
|
42 |
+
grid-column-end: 4 !important;
|
43 |
+
grid-row: auto / span 6;
|
44 |
+
justify-self: end;
|
45 |
+
margin-top: 0em;
|
46 |
+
padding-right: 3em;
|
47 |
+
padding-left: 2em;
|
48 |
+
border-right: 1px solid rgba(0, 0, 0, 0.1);
|
49 |
+
border-right-width: 1px;
|
50 |
+
border-right-style: solid;
|
51 |
+
border-right-color: rgba(0, 0, 0, 0.1);
|
52 |
+
position: -webkit-sticky; /* For Safari */
|
53 |
+
position: sticky;
|
54 |
+
top: 0; /* Adjust this value if needed */
|
55 |
+
}
|
56 |
+
}
|
57 |
+
|
58 |
+
d-contents nav h3 {
|
59 |
+
margin-top: 0;
|
60 |
+
margin-bottom: 1em;
|
61 |
+
}
|
62 |
+
|
63 |
+
d-contents nav div {
|
64 |
+
color: rgba(0, 0, 0, 0.8);
|
65 |
+
font-weight: bold;
|
66 |
+
}
|
67 |
+
|
68 |
+
d-contents nav a {
|
69 |
+
color: rgba(0, 0, 0, 0.8);
|
70 |
+
border-bottom: none;
|
71 |
+
text-decoration: none;
|
72 |
+
}
|
73 |
+
|
74 |
+
d-contents li {
|
75 |
+
list-style-type: none;
|
76 |
+
}
|
77 |
+
|
78 |
+
d-contents ul, d-article d-contents ul {
|
79 |
+
padding-left: 1em;
|
80 |
+
}
|
81 |
+
|
82 |
+
d-contents nav ul li {
|
83 |
+
margin-bottom: .25em;
|
84 |
+
}
|
85 |
+
|
86 |
+
d-contents nav a:hover {
|
87 |
+
text-decoration: underline solid rgba(0, 0, 0, 0.6);
|
88 |
+
}
|
89 |
+
|
90 |
+
d-contents nav ul {
|
91 |
+
margin-top: 0;
|
92 |
+
margin-bottom: 6px;
|
93 |
+
}
|
94 |
+
|
95 |
+
|
96 |
+
d-contents nav > div {
|
97 |
+
display: block;
|
98 |
+
outline: none;
|
99 |
+
margin-bottom: 0.5em;
|
100 |
+
}
|
101 |
+
|
102 |
+
d-contents nav > div > a {
|
103 |
+
font-size: 13px;
|
104 |
+
font-weight: 600;
|
105 |
+
}
|
106 |
+
|
107 |
+
d-contents nav > div > a:hover,
|
108 |
+
d-contents nav > ul > li > a:hover {
|
109 |
+
text-decoration: none;
|
110 |
+
}
|
111 |
+
|
112 |
+
</style>
|
113 |
</head>
|
114 |
|
115 |
<body>
|
|
|
149 |
<figure style="grid-column: page; mix-blend-mode: multiply;">
|
150 |
<img src="banner.png" alt="FineWeb">
|
151 |
</figure>
|
|
|
|
|
|
|
152 |
</d-title>
|
153 |
<d-byline></d-byline>
|
154 |
<d-article>
|
155 |
+
<d-contents>
|
156 |
+
</d-contents>
|
157 |
+
|
158 |
<p>We have recently released 🍷FineWeb, our new large scale
|
159 |
(15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
|
160 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
|
|
|
164 |
<p><strong>TLDR:</strong> This blog covers the FineWeb
|
165 |
recipe, why more deduplication is not always better and some interesting findings on the difference in
|
166 |
quality of CommonCrawl dumps.</p>
|
167 |
+
|
168 |
+
<h2>Preamble</h2>
|
169 |
+
<h3>Sourcing the data</h3>
|
170 |
<p>A common question we see asked regarding web datasets used
|
171 |
to train LLMs is “where do they even get all that data?” There are generally two options:</p>
|
172 |
+
<ul>
|
173 |
+
<li>you either crawl it yourself, like <a
|
174 |
href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
|
175 |
href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
|
176 |
</li>
|
177 |
</ul>
|
178 |
+
<ul>
|
179 |
+
<li>you use a public repository of crawled webpages, like the one maintained by
|
180 |
the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
|
181 |
</ul>
|
182 |
<p>For FineWeb, similarly to what was done for a large number
|
|
|
186 |
<p>As an example, their latest crawl (2024-10) contains 3.16
|
187 |
billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There
|
188 |
are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p>
|
189 |
+
<h3>Processing at scale</h3>
|
190 |
<p>Given the sheer size of the data involved, one of the main
|
191 |
challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
|
192 |
on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads
|
|
|
194 |
<p>For this purpose, we developed <a
|
195 |
href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data
|
196 |
processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
|
197 |
+
CPU cores. All the data processing steps involved in the creation of FineWeb used this <a
|
198 |
href="https://github.com/huggingface/datatrove">library</a>.</p>
|
199 |
+
<h3>What is clean, good data?</h3>
|
200 |
<p>This is probably the main question to keep in mind when
|
201 |
creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a
|
202 |
human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p>
|
|
|
232 |
href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
|
233 |
benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
|
234 |
billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
|
235 |
+
<ul>
|
236 |
+
<li>small variance between runs trained on different samplings of the same
|
237 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
238 |
resulting scores to have as little noise as possible
|
239 |
</li>
|
240 |
</ul>
|
241 |
+
<ul>
|
242 |
+
<li>performance increasing monotonically (or close) over a training run:
|
243 |
ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
|
244 |
(should not be too noisy)
|
245 |
</li>
|
|
|
248 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To
|
249 |
have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
|
250 |
min on a single node of 8 GPUs - done in parallel to the training).</p>
|
251 |
+
<h2>The FineWeb recipe</h2>
|
|
|
252 |
<p>In the next subsections we will explain each of the steps
|
253 |
taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a
|
254 |
href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p>
|
255 |
+
<figure class="l-body">
|
|
|
|
|
|
|
256 |
<img src="plots/fineweb-recipe.png"/>
|
257 |
</figure>
|
258 |
+
<h3>Starting point: text extraction</h3>
|
259 |
<p>CommonCrawl data is available in two main formats: WARC
|
260 |
and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
|
261 |
full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
|
|
|
274 |
resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
|
275 |
quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
|
276 |
these additional tokens on the WET files are unnecessary page boilerplate.</p>
|
277 |
+
<figure><img src="plots/wet_comparison.png"/></figure>
|
278 |
|
279 |
+
<h3>Base filtering</h3>
|
280 |
<p>Filtering is an important part of the curation process. It
|
281 |
removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
|
282 |
deemed to be “lower quality”.</p>
|
283 |
<p>As a basis for our filtering we used part of the setup
|
284 |
from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p>
|
285 |
+
<ul>
|
286 |
+
<li>Applied URL filtering using a <a
|
287 |
href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content
|
288 |
</li>
|
289 |
</ul>
|
290 |
+
<ul>
|
291 |
+
<li>Applied a <a
|
292 |
href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to
|
293 |
keep only English text with a score ≥ 0.65
|
294 |
</li>
|
295 |
</ul>
|
296 |
+
<ul>
|
297 |
+
<li>Applied quality and repetition filters from the <a
|
298 |
href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds)
|
299 |
</li>
|
300 |
</ul>
|
301 |
<p>After applying this filtering to each of the text
|
302 |
extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
|
303 |
tokenized with the <code>gpt2</code> tokenizer).</p>
|
304 |
+
<h3>Deduplication</h3>
|
305 |
<p>Deduplication is another important step, specially for web
|
306 |
datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of
|
307 |
the most important steps when creating large web datasets for LLMs.</p>
|
308 |
+
<h4>Why deduplicate?</h4>
|
309 |
<p>The web has many aggregators, mirrors, templated pages or
|
310 |
just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
|
311 |
can be introduced by the crawler itself, when different links point to the same page. </p>
|
|
|
330 |
92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability
|
331 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
332 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
|
333 |
+
<figure><img src="plots/minhash_parameters_comparison.png"/>
|
|
|
334 |
</figure>
|
335 |
<p>While the high number of hash functions in RefinedWeb
|
336 |
allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
|
|
|
350 |
trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
|
351 |
tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
|
352 |
green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
|
353 |
+
<figure><img src="plots/dedup_all_dumps_bad.png"/></figure>
|
354 |
<p>This was quite puzzling as our intuition regarding web
|
355 |
data was that more deduplication would always result in improved performance. We decided to take a closer
|
356 |
look at one of the oldest dumps, dump 2013-48:</p>
|
357 |
+
<ul>
|
358 |
+
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
359 |
</ul>
|
360 |
+
<ul>
|
361 |
+
<li>after our iterative MinHash, ~31 billion tokens remained (94% of data
|
362 |
removed)
|
363 |
</li>
|
364 |
</ul>
|
365 |
<p>As an experiment, we tried training two models on 28BT
|
366 |
sampled from the following data from 2013-48:</p>
|
367 |
+
<ul>
|
368 |
+
<li>the fully deduplicated remaining ~31 billion tokens (<em>originally kept
|
369 |
data</em>)
|
370 |
</li>
|
371 |
</ul>
|
372 |
+
<ul>
|
373 |
+
<li>171 billion tokens obtained by individually deduplicating (without
|
374 |
considering the other dumps) the ~460 billion tokens that had been removed from this dump in the
|
375 |
iterative dedup process (<em>originally removed data</em>)
|
376 |
</li>
|
377 |
</ul>
|
378 |
+
<figure><img src="plots/removed_data_cross_dedup.png"/></figure>
|
|
|
379 |
<p>These results show that, for this older dump where we were
|
380 |
removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
|
381 |
+
removed (considered independently of all the other dumps).</p>
|
382 |
<h3>Taking a step back: individual dump dedup</h3>
|
383 |
<p>We then tried an alternative approach: we deduplicated
|
384 |
each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
|
385 |
tokens of data.</p>
|
386 |
<p>When training on a random sample from this dataset we see
|
387 |
that it now matches RefinedWeb’s performance (blue and red curves below):</p>
|
388 |
+
<figure><img src="plots/cross_ind_unfiltered_comparison.png"/>
|
|
|
389 |
</figure>
|
390 |
+
<p>We hypothesize that the main improvement gained from
|
391 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
392 |
some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
|
393 |
documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number
|
|
|
400 |
improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
|
401 |
lower quality data. We also experimented with applying different, and often “lighter”, deduplication
|
402 |
approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
|
403 |
+
<h4>A note on measuring the effect of deduplication</h4>
|
404 |
<p>Given the nature of deduplication, its effect is not
|
405 |
always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
|
406 |
filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
|
|
|
408 |
<p>To visualize the effect of scaling the number of training
|
409 |
tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic
|
410 |
regarding the degree of duplication observed) theoretical scenario:</p>
|
411 |
+
<ul>
|
412 |
+
<li>there are 100 CommonCrawl dumps (actually roughly true)</li>
|
413 |
</ul>
|
414 |
+
<ul>
|
415 |
+
<li>each dump has been perfectly individually deduplicated (every single
|
416 |
document in it is unique)
|
417 |
</li>
|
418 |
</ul>
|
419 |
+
<ul>
|
420 |
+
<li>each dump is a perfect copy of each other (maximum possible duplication
|
421 |
across dumps, effectively the worst case scenario)
|
422 |
</li>
|
423 |
</ul>
|
424 |
+
<ul>
|
425 |
+
<li>each dump has 200 billion tokens (for a total of 20 trillion, the resulting
|
426 |
size of our individual dedup above)
|
427 |
</li>
|
428 |
</ul>
|
429 |
+
<ul>
|
430 |
+
<li>each dump is made up of documents of 1k tokens (200M documents per dump)
|
431 |
</li>
|
432 |
</ul>
|
433 |
<p>We then simulated uniformly sampling documents from this
|
434 |
entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
|
435 |
below you can see how often each document would be repeated.</p>
|
436 |
+
<figure><img src="plots/dedup_impact_simulation.png"/></figure>
|
437 |
<p>For 1B almost all documents would be unique
|
438 |
(#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
|
439 |
dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
|
|
|
445 |
documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
|
446 |
measuring deduplication impact on the training of LLMs, once the biggest document clusters have been
|
447 |
removed.</p>
|
448 |
+
<h4>Other (failed) approaches</h4>
|
449 |
<p>We attempted to improve the performance of the
|
450 |
independently minhash deduped 20T of data by further deduplicating it with the following methods</p>
|
451 |
+
<ul>
|
452 |
+
<li>URL deduplication, where we only kept one document per normalized
|
453 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
|
454 |
</ul>
|
455 |
+
<ul>
|
456 |
+
<li>Line deduplication:
|
457 |
+
<ul>
|
458 |
+
<li>remove all but 1 occurrence of each duplicated line (77.8% of
|
459 |
tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
|
460 |
</ul>
|
461 |
+
<ul>
|
462 |
+
<li>same as above, but only removing duplicate lines with at least 10
|
463 |
words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
|
464 |
dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
|
465 |
</ul>
|
466 |
+
<ul>
|
467 |
+
<li>remove all but 1 occurrence of each span of 3 duplicated lines
|
468 |
with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line
|
469 |
dedup</em></li>
|
470 |
</ul>
|
|
|
473 |
<p>The performance of the models trained on each of these was
|
474 |
consistently worse (even if to different degrees) than that of the original independently deduplicated
|
475 |
data:</p>
|
476 |
+
<figure><img src="plots/Untitled.png"/></figure>
|
477 |
+
<h3>Additional filtering</h3>
|
478 |
<p>By this point we had reached the same performance as
|
479 |
RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
|
480 |
href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with
|
|
|
482 |
<p>We therefore set out to find new filtering steps that
|
483 |
would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point
|
484 |
was to look into the processing of C4 itself.</p>
|
485 |
+
<h4>C4: A dataset that has stood the test of time</h4>
|
486 |
<p>The <a href="https://huggingface.co/datasets/c4">C4
|
487 |
dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by
|
488 |
removing non english data, applying some heuristic filters on both the line and document level,
|
|
|
494 |
<a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
|
495 |
each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
|
496 |
(plot smoothed with a 3 checkpoints sliding window):</p>
|
497 |
+
<figure><img src="plots/c4_filters.png"/></figure>
|
498 |
+
<ul>
|
499 |
+
<li>applying “All filters” (drop lines not ending on punctuation marks,
|
500 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
501 |
ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
|
502 |
pink curves).
|
503 |
</li>
|
504 |
</ul>
|
505 |
+
<ul>
|
506 |
+
<li>The curly bracket filter, and the word lengths filter only give a small
|
507 |
boost, removing 2.8% and 4.3% of tokens, respectively
|
508 |
</li>
|
509 |
</ul>
|
510 |
+
<ul>
|
511 |
+
<li>The terminal punctuation filter, by itself, gives the biggest individual
|
512 |
boost, but removes <em>around 30%</em> of all tokens (!)
|
513 |
</li>
|
514 |
</ul>
|
515 |
+
<ul>
|
516 |
+
<li>The lorem_ipsum, javascript and policy rules each remove <0.5% of
|
517 |
training tokens, so we did not train on them individually
|
518 |
</li>
|
519 |
</ul>
|
520 |
+
<ul>
|
521 |
+
<li>All filters except the very destructive terminal_punct perform better than
|
522 |
terminal_punct by itself, while removing less in total (~7%)
|
523 |
</li>
|
524 |
</ul>
|
525 |
<p>We decided to apply all C4 filters mentioned above except
|
526 |
the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
|
527 |
the next section.</p>
|
528 |
+
<h4>A statistical approach to develop heuristic filters</h4>
|
529 |
<p>To come up with new possible filtering rules, we collected
|
530 |
a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference
|
531 |
datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
|
|
|
542 |
caused by lower quality data on the full dedup version, we inspected histograms and manually defined
|
543 |
thresholds for the metrics where these differences were starker. This process yielded 17 candidate
|
544 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
545 |
+
<figure><img src="plots/Untitled%201.png"/></figure>
|
546 |
|
547 |
<p>To assess the effectiveness of these newly created
|
548 |
filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
|
549 |
of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
|
550 |
the most significant improvements on the aggregate score:</p>
|
551 |
+
<ul>
|
552 |
+
<li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
|
553 |
(10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
|
554 |
</li>
|
555 |
</ul>
|
556 |
+
<ul>
|
557 |
+
<li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
|
558 |
(12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
|
559 |
</li>
|
560 |
</ul>
|
561 |
+
<ul>
|
562 |
+
<li>Remove documents where the fraction of lines shorter than 30 characters ≥
|
563 |
0.67 (3.73% of tokens removed)
|
564 |
</li>
|
565 |
</ul>
|
566 |
+
<ul>
|
567 |
+
<li>When applying the 3 together, ~22% of tokens were removed</li>
|
568 |
</ul>
|
569 |
+
<figure><img src="plots/Untitled%202.png"/></figure>
|
570 |
+
<h2>The final dataset</h2>
|
|
|
571 |
<p>The final FineWeb dataset comprises 15T tokens and
|
572 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
573 |
of benchmark tasks:</p>
|
574 |
+
<ul>
|
575 |
+
<li>base filtering</li>
|
576 |
</ul>
|
577 |
+
<ul>
|
578 |
+
<li>independent MinHash deduplication per dump</li>
|
579 |
</ul>
|
580 |
+
<ul>
|
581 |
+
<li>a selection of C4 filters</li>
|
582 |
</ul>
|
583 |
+
<ul>
|
584 |
+
<li>our custom filters (mentioned in the previous section)</li>
|
585 |
</ul>
|
586 |
+
<figure><img src="plots/fineweb_all_filters.png"/></figure>
|
587 |
<p>We compared 🍷 FineWeb with the following datasets:</p>
|
588 |
+
<ul>
|
589 |
+
<li><a
|
590 |
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a>
|
591 |
</li>
|
592 |
</ul>
|
593 |
+
<ul>
|
594 |
+
<li><a href="https://huggingface.co/datasets/allenai/c4">C4</a></li>
|
595 |
</ul>
|
596 |
+
<ul>
|
597 |
+
<li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
|
598 |
CommonCrawl part)
|
599 |
</li>
|
600 |
</ul>
|
601 |
+
<ul>
|
602 |
+
<li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a></li>
|
603 |
</ul>
|
604 |
+
<ul>
|
605 |
+
<li><a
|
606 |
href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a>
|
607 |
</li>
|
608 |
</ul>
|
609 |
+
<ul>
|
610 |
+
<li><a
|
611 |
href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a>
|
612 |
(deduplicated)
|
613 |
</li>
|
|
|
617 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
618 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
619 |
results here</a>.</p>
|
620 |
+
<figure><img src="plots/fineweb_ablations.png"/></figure>
|
621 |
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and
|
622 |
FineWeb:</p>
|
623 |
+
<figure><img src="plots/Untitled%203.png"/></figure>
|
624 |
+
<h2>Just like fine wine, not all crawls are created
|
625 |
+
equal</h2>
|
|
|
626 |
<p>During our ablation runs, we observed that certain crawls
|
627 |
outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
|
628 |
each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
|
|
|
630 |
the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
|
631 |
<p>The plot below clearly shows that some dumps perform far
|
632 |
worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
|
633 |
+
<figure><img src="plots/score_by_dump.png"/></figure>
|
634 |
<p>We identified 5 main relevant time intervals:</p>
|
635 |
+
<ul>
|
636 |
+
<li>2013 to 2016: relatively stable, average quality</li>
|
637 |
</ul>
|
638 |
+
<ul>
|
639 |
+
<li>2017 to 2018: high quality, with a drop by the end of 2018</li>
|
640 |
</ul>
|
641 |
+
<ul>
|
642 |
+
<li>2019 to 2021: high quality, steadily increase</li>
|
643 |
</ul>
|
644 |
+
<ul>
|
645 |
+
<li>2021-49 and 2022: very large drop in performance, followed by worse quality
|
646 |
dumps
|
647 |
</li>
|
648 |
</ul>
|
649 |
+
<ul>
|
650 |
+
<li>2023 and 2024-10: almost exponential improvement. In particular, 2023-50
|
651 |
and 2024-10 are by far the best dumps
|
652 |
</li>
|
653 |
</ul>
|
|
|
655 |
models on < 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
|
656 |
<p>We conducted further analysis to investigate the factors
|
657 |
causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
|
658 |
+
<ul>
|
659 |
+
<li>large sudden changes in the list of crawled URLs;</li>
|
660 |
</ul>
|
661 |
+
<ul>
|
662 |
+
<li>synthetic (LLM generated) data;</li>
|
663 |
</ul>
|
664 |
+
<ul>
|
665 |
+
<li>benchmark contamination;</li>
|
666 |
</ul>
|
667 |
<p>We go over each one in the following sections.</p>
|
668 |
<h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
|
|
|
672 |
crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
|
673 |
it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
|
674 |
that alternatively new FQDNs were added to the top 60k.</p>
|
675 |
+
<figure><img src="plots/Untitled%204.png"/></figure>
|
676 |
<p>The data indicates three significant changes:
|
677 |
2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
|
678 |
<p>The explanation for the changes between 2022-33/2022-40
|
|
|
704 |
not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
|
705 |
dumps, one would expect these frequencies to remain approximately constant over time.</p>
|
706 |
<p>The results are shown in the following graph:</p>
|
707 |
+
<figure><img src="plots/Untitled%205.png"/></figure>
|
708 |
<p>While the frequency remained approximately constant until
|
709 |
2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
|
710 |
in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
|
|
|
719 |
evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
|
720 |
of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
|
721 |
|
722 |
+
<figure><img src="plots/Untitled%206.png"/></figure>
|
723 |
+
<h2>Next steps</h2>
|
|
|
724 |
<p>We want to continue improving FineWeb and will also
|
725 |
release a technical report with more details soon.</p>
|
726 |
<p>Adapting the FineWeb recipe [wip]</p>
|
|
|
735 |
|
736 |
<d-bibliography src="bibliography.bib"></d-bibliography>
|
737 |
</d-appendix>
|
738 |
+
|
739 |
+
<script>
|
740 |
+
const article = document.querySelector('d-article');
|
741 |
+
const toc = document.querySelector('d-contents');
|
742 |
+
if (toc) {
|
743 |
+
const headings = article.querySelectorAll('h2, h3');
|
744 |
+
let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
|
745 |
+
|
746 |
+
let elements = [];
|
747 |
+
for (const el of headings) {
|
748 |
+
// should element be included in TOC?
|
749 |
+
const isInTitle = el.parentElement.tagName == 'D-TITLE';
|
750 |
+
const isException = el.getAttribute('no-toc');
|
751 |
+
if (isInTitle || isException) continue;
|
752 |
+
el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
|
753 |
+
const link = '<a href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
|
754 |
+
if (el.tagName === 'H2')
|
755 |
+
elements.push([link, []]);
|
756 |
+
else {
|
757 |
+
if (elements.length === 0)
|
758 |
+
elements.push([null, []])
|
759 |
+
elements[elements.length - 1][1].push(link)
|
760 |
+
}
|
761 |
+
}
|
762 |
+
|
763 |
+
for (const topLevel of elements) {
|
764 |
+
ToC += '<div>' + topLevel[0] + '</div><ul>';
|
765 |
+
for (const subLevel of topLevel[1])
|
766 |
+
ToC += '<li>' + subLevel + '</li>';
|
767 |
+
ToC += '</ul>';
|
768 |
+
}
|
769 |
+
ToC += '</nav>';
|
770 |
+
toc.innerHTML = ToC;
|
771 |
+
toc.setAttribute('prerendered', 'true');
|
772 |
+
}
|
773 |
+
</script>
|
774 |
</body>
|