Update README.md
Browse files
README.md
CHANGED
@@ -88,11 +88,18 @@ predict_educational_value(["Hi"])
|
|
88 |
# Output: [3.0000010156072676e-05]
|
89 |
|
90 |
```
|
91 |
-
|
92 |
To make sure this classifier makes sense, it is applied to various datasets.
|
93 |
|
94 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
|
97 |
|Dataset | Sampling | Average Educational Value | Type |
|
98 |
|--------------------------------------|---|-------------------|-------|
|
@@ -115,7 +122,11 @@ Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
|
115 |
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
|
116 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
|
117 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
|
|
|
118 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
|
|
|
|
|
|
|
119 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
120 |
|
121 |
|
|
|
88 |
# Output: [3.0000010156072676e-05]
|
89 |
|
90 |
```
|
91 |
+
# Benchmark
|
92 |
To make sure this classifier makes sense, it is applied to various datasets.
|
93 |
|
94 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
95 |
|
96 |
+
The score can be interpreted as:
|
97 |
+
|Educational Value| Category |
|
98 |
+
|--------|----------|
|
99 |
+
|2 | High|
|
100 |
+
|1 | Mid|
|
101 |
+
|0 | Low|
|
102 |
+
|
103 |
|
104 |
|Dataset | Sampling | Average Educational Value | Type |
|
105 |
|--------------------------------------|---|-------------------|-------|
|
|
|
122 |
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
|
123 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
|
124 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
|
125 |
+
|[togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 10,000 | 0.979|Real|
|
126 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
|
127 |
+
|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 10,000 | 0.798|Real|
|
128 |
+
|
129 |
+
|
130 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
131 |
|
132 |
|