loubnabnl HF staff commited on
Commit
5385888
Β·
1 Parent(s): 7c87bf2
Files changed (1) hide show
  1. index.html +9 -6
index.html CHANGED
@@ -692,17 +692,20 @@
692
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
693
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
694
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
695
- <h3>Filtering</h3>
696
- <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. To build πŸ“š FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
 
 
 
 
697
  <ul>
698
  <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
699
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
700
- <li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
701
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
702
  </ul>
703
- <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the 🍷 FineWeb dataset, with performance just slightly below that of threshold 3.</p>
704
- <p>We release these two datasets as πŸ“š FineWeb-Edu and πŸ“š FineWeb-edu-Large along with the classifier used for the filtering.</p>
705
- <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
706
  <h2>Next steps</h2>
707
  <p>We want to continue improving FineWeb and will also
708
  release a technical report with more details soon.</p>
 
692
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
693
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
694
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
695
+ <h3>Filtering and results</h3>
696
+ <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
697
+ <p><strong>TODO: add the plot</strong></p>
698
+ <p>We then built πŸ“š FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
699
+ <p><strong>TODO: add the plot</strong></p>
700
+ <p>Here are the key highlights of the ablation results above:</p>
701
  <ul>
702
  <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
703
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
 
704
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
705
  </ul>
706
+ <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens. Additionally, for research purposes, we are providing the dataset filtered with a threshold of 4 with 300 billion tokens.</p>
707
+ <p>You can find the three datasets along with the classifier used for the filtering in this collection:TODO</p>
708
+ <p><strong>TODO: add dataset links and a collection</strong></p>
709
  <h2>Next steps</h2>
710
  <p>We want to continue improving FineWeb and will also
711
  release a technical report with more details soon.</p>