update
Browse files- index.html +9 -6
index.html
CHANGED
@@ -692,17 +692,20 @@
|
|
692 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
693 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
|
694 |
<p><strong>TODO: fill model card and move the github code to another folder</strong></p>
|
695 |
-
<h3>Filtering</h3>
|
696 |
-
<p>We applied the classifier to the 15T tokens of π· FineWeb, a process that required 6,000 H100 GPU hours.
|
|
|
|
|
|
|
|
|
697 |
<ul>
|
698 |
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
699 |
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
|
700 |
-
<li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
|
701 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
702 |
</ul>
|
703 |
-
<p>
|
704 |
-
<p>
|
705 |
-
<p><strong>TODO: add
|
706 |
<h2>Next steps</h2>
|
707 |
<p>We want to continue improving FineWeb and will also
|
708 |
release a technical report with more details soon.</p>
|
|
|
692 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
693 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
|
694 |
<p><strong>TODO: fill model card and move the github code to another folder</strong></p>
|
695 |
+
<h3>Filtering and results</h3>
|
696 |
+
<p>We applied the classifier to the 15T tokens of π· FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
697 |
+
<p><strong>TODO: add the plot</strong></p>
|
698 |
+
<p>We then built π FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
|
699 |
+
<p><strong>TODO: add the plot</strong></p>
|
700 |
+
<p>Here are the key highlights of the ablation results above:</p>
|
701 |
<ul>
|
702 |
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
703 |
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
|
|
|
704 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
705 |
</ul>
|
706 |
+
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens. Additionally, for research purposes, we are providing the dataset filtered with a threshold of 4 with 300 billion tokens.</p>
|
707 |
+
<p>You can find the three datasets along with the classifier used for the filtering in this collection:TODO</p>
|
708 |
+
<p><strong>TODO: add dataset links and a collection</strong></p>
|
709 |
<h2>Next steps</h2>
|
710 |
<p>We want to continue improving FineWeb and will also
|
711 |
release a technical report with more details soon.</p>
|