Update README.md
Browse files
README.md
CHANGED
@@ -8,23 +8,24 @@ inference: false
|
|
8 |
---
|
9 |
# llm-data-textbook-quality-fasttext-classifer-v2
|
10 |
|
11 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/
|
|
|
12 |
|
13 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
14 |
|
15 |
This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
|
16 |
Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
|
17 |
-
|
18 |
It can be used as a filter for data curation when training a LLM.
|
19 |
There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
|
20 |
- High (Top 25% educational value)
|
21 |
- Mid (Middle 25-75% educational value)
|
22 |
- Low (Bottom 25% educational value)
|
23 |
|
24 |
-
|
25 |
-
|
26 |
-
The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
|
27 |
|
|
|
28 |
|
29 |
## Feedback welcomed!
|
30 |
Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
|
@@ -132,9 +133,10 @@ The score can be roughly interpreted as:
|
|
132 |
|[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
|
133 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
|
134 |
|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
|
|
|
|
|
135 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
136 |
|
137 |
-
|
138 |
The classifier aligns with the expectation.
|
139 |
|
140 |
- In general, the synthetic data has higher education value because they are created with a high educational value by design.
|
@@ -143,4 +145,5 @@ The classifier aligns with the expectation.
|
|
143 |
- Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
|
144 |
- Maths/ paper category scores the second highest, because of its density of knowledge.
|
145 |
- Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
|
146 |
-
- Web scores
|
|
|
|
8 |
---
|
9 |
# llm-data-textbook-quality-fasttext-classifer-v2
|
10 |
|
11 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/pxajU51PTfF9qpDiugh4n.png)
|
12 |
+
|
13 |
|
14 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
15 |
|
16 |
This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
|
17 |
Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
|
18 |
+
The model can classify if a text has high educational value (more explicitly defined then textbook quality). The definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
|
19 |
It can be used as a filter for data curation when training a LLM.
|
20 |
There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
|
21 |
- High (Top 25% educational value)
|
22 |
- Mid (Middle 25-75% educational value)
|
23 |
- Low (Bottom 25% educational value)
|
24 |
|
25 |
+
A detailed report/ paper will follow when more downstream experiments of this classifier become available.
|
26 |
+
The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
|
|
|
27 |
|
28 |
+
Please note textbook quality is a subset of high quality.
|
29 |
|
30 |
## Feedback welcomed!
|
31 |
Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
|
|
|
133 |
|[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
|
134 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
|
135 |
|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
|
136 |
+
|[BEE-spoke-data/FineMeme-100k](https://huggingface.co/datasets/BEE-spoke-data/FineMeme-100k)| First 100,000 | 0.716 |Real|
|
137 |
+
|[neuralcatcher/hateful_memes](https://huggingface.co/datasets/neuralcatcher/hateful_memes)| First 100,000 | 0.070 |Real|
|
138 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
139 |
|
|
|
140 |
The classifier aligns with the expectation.
|
141 |
|
142 |
- In general, the synthetic data has higher education value because they are created with a high educational value by design.
|
|
|
145 |
- Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
|
146 |
- Maths/ paper category scores the second highest, because of its density of knowledge.
|
147 |
- Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
|
148 |
+
- Web scores low (if no filtering is applied) because it contains all domains.
|
149 |
+
- Meme scores the lowest as expected. Hateful memes almost got zero point.
|