Text Classification
fastText
English
kenhktsui commited on
Commit
8e4f87c
1 Parent(s): 6b75c18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -7
README.md CHANGED
@@ -8,23 +8,24 @@ inference: false
8
  ---
9
  # llm-data-textbook-quality-fasttext-classifer-v2
10
 
11
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/mC3xwgJJ139R9LXbXpBWj.png)
 
12
 
13
  ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
14
 
15
  This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
16
  Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
17
- This model can classify if a text has high educational value (more explicitly defined then textbook quality). This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
18
  It can be used as a filter for data curation when training a LLM.
19
  There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
20
  - High (Top 25% educational value)
21
  - Mid (Middle 25-75% educational value)
22
  - Low (Bottom 25% educational value)
23
 
24
- Please note textbook quality is a subset of high quality.
25
- A detailed report/ paper will follow when more downstream experiments of this classifier become available.
26
- The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
27
 
 
28
 
29
  ## Feedback welcomed!
30
  Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
@@ -132,9 +133,10 @@ The score can be roughly interpreted as:
132
  |[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
133
  |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
134
  |[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
 
 
135
  \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
136
 
137
-
138
  The classifier aligns with the expectation.
139
 
140
  - In general, the synthetic data has higher education value because they are created with a high educational value by design.
@@ -143,4 +145,5 @@ The classifier aligns with the expectation.
143
  - Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
144
  - Maths/ paper category scores the second highest, because of its density of knowledge.
145
  - Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
146
- - Web scores the lowest (if no filtering is applied) because it contains all domains.
 
 
8
  ---
9
  # llm-data-textbook-quality-fasttext-classifer-v2
10
 
11
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/pxajU51PTfF9qpDiugh4n.png)
12
+
13
 
14
  ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
15
 
16
  This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
17
  Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
18
+ The model can classify if a text has high educational value (more explicitly defined then textbook quality). The definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
19
  It can be used as a filter for data curation when training a LLM.
20
  There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
21
  - High (Top 25% educational value)
22
  - Mid (Middle 25-75% educational value)
23
  - Low (Bottom 25% educational value)
24
 
25
+ A detailed report/ paper will follow when more downstream experiments of this classifier become available.
26
+ The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
 
27
 
28
+ Please note textbook quality is a subset of high quality.
29
 
30
  ## Feedback welcomed!
31
  Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
 
133
  |[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
134
  |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
135
  |[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
136
+ |[BEE-spoke-data/FineMeme-100k](https://huggingface.co/datasets/BEE-spoke-data/FineMeme-100k)| First 100,000 | 0.716 |Real|
137
+ |[neuralcatcher/hateful_memes](https://huggingface.co/datasets/neuralcatcher/hateful_memes)| First 100,000 | 0.070 |Real|
138
  \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
139
 
 
140
  The classifier aligns with the expectation.
141
 
142
  - In general, the synthetic data has higher education value because they are created with a high educational value by design.
 
145
  - Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
146
  - Maths/ paper category scores the second highest, because of its density of knowledge.
147
  - Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
148
+ - Web scores low (if no filtering is applied) because it contains all domains.
149
+ - Meme scores the lowest as expected. Hateful memes almost got zero point.