Update README.md
Browse files
README.md
CHANGED
@@ -116,15 +116,15 @@ Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
|
116 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
|
117 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
|
118 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
|
|
|
119 |
|
120 |
-
* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
121 |
|
122 |
-
|
123 |
|
124 |
-
|
125 |
-
|
126 |
-
|
|
|
|
|
127 |
|
128 |
-
In general, the synthetic data has higher education value because they are created with a high educational value by design.
|
129 |
-
For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied extensive quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.
|
130 |
|
|
|
116 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
|
117 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
|
118 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
|
119 |
+
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
120 |
|
|
|
121 |
|
122 |
+
The classifier aligns with the expectation.
|
123 |
|
124 |
+
- In general, the synthetic data has higher education value because they are created with a high educational value by design.
|
125 |
+
For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.
|
126 |
+
- Textbook category scores the highest, reflecting the effectiveness of this model.
|
127 |
+
- Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
|
128 |
+
- Web scores the lowest.
|
129 |
|
|
|
|
|
130 |
|