Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

loubnabnl HF Staff commited on May 30, 2024

Commit

5385888

1 Parent(s): 7c87bf2

update

Browse files

Files changed (1) hide show

index.html +9 -6

index.html CHANGED Viewed

@@ -692,17 +692,20 @@
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
     <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
-    <h3>Filtering</h3>
-    <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. To build 📚 FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
         <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
-        <li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
-    <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the 🍷 FineWeb dataset, with performance just slightly below that of threshold 3.</p>
-    <p>We release these two datasets as 📚 FineWeb-Edu and 📚 FineWeb-edu-Large along with the classifier used for the filtering.</p>
-    <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>

     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
     <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
+    <h3>Filtering and results</h3>
+    <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
+    <p><strong>TODO: add the plot</strong></p>
+    <p>We then built  📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens.  To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
+    <p><strong>TODO: add the plot</strong></p>
+    <p>Here are the key highlights of the ablation results above:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
         <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
+    <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens. Additionally, for research purposes, we are providing the dataset filtered with a threshold of 4 with 300 billion tokens.</p>
+    <p>You can find the three datasets along with the classifier used for the filtering in this collection:TODO</p>
+    <p><strong>TODO: add dataset links and a collection</strong></p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>