Text Classification
fastText
English
kenhktsui commited on
Commit
5dccdbc
1 Parent(s): 007f646

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -116,15 +116,15 @@ Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
116
  |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
117
  |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
118
  |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
 
119
 
120
- * I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
121
 
122
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/fCWpxWB1yLmJwWhPXsjIw.png)
123
 
124
- The classifier aligns with the expectation. Textbook category scores the highest, reflecting the effectiveness of this model.
125
- Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
126
- Web scores the lowest.
 
 
127
 
128
- In general, the synthetic data has higher education value because they are created with a high educational value by design.
129
- For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied extensive quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.
130
 
 
116
  |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
117
  |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
118
  |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
119
+ \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
120
 
 
121
 
122
+ The classifier aligns with the expectation.
123
 
124
+ - In general, the synthetic data has higher education value because they are created with a high educational value by design.
125
+ For real data, [Dolma v1_7](https://huggingface.co/datasets/allenai/dolma), which applied quality filter described in [here](https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d), has the highest educational value across all real data.
126
+ - Textbook category scores the highest, reflecting the effectiveness of this model.
127
+ - Wikipedia scores comparatively lower because it is not textbook after all and it also contains information (result of a match) that has small educational value.
128
+ - Web scores the lowest.
129
 
 
 
130