kenhktsui
/

llm-data-textbook-quality-fasttext-classifier-v2

Text Classification

Model card Files Files and versions Community

kenhktsui commited on May 19, 2024

Commit

5dccdbc

·

verified ·

1 Parent(s): 007f646

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -116,15 +116,15 @@ Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
 |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
 |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
 |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
-* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/fCWpxWB1yLmJwWhPXsjIw.png)
-The classifier aligns with the expectation. Textbook category scores the highest, reflecting the effectiveness of this model.
-Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
-Web scores the lowest.
-In general, the synthetic data has higher education value because they are created with a high educational value by design.
-For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied extensive quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.

 |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
 |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
 |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
+\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
+The classifier aligns with the expectation.
+- In general, the synthetic data has higher education value because they are created with a high educational value by design.
+For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.
+- Textbook category scores the highest, reflecting the effectiveness of this model.
+- Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
+- Web scores the lowest.