kenhktsui
/

llm-data-textbook-quality-fasttext-classifier-v2

Text Classification

fastText

English

Model card Files Files and versions Community

kenhktsui commited on May 20, 2024

Commit

8e4f87c

verified ·

1 Parent(s): 6b75c18

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -7

README.md CHANGED Viewed

@@ -8,23 +8,24 @@ inference: false
 ---
 # llm-data-textbook-quality-fasttext-classifer-v2
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/mC3xwgJJ139R9LXbXpBWj.png)
 ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
 This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
 Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
-This model can classify if a text has high educational value (more explicitly defined then textbook quality).  This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
 It can be used as a filter for data curation when training a LLM.
 There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
 - High (Top 25% educational value)
 - Mid (Middle 25-75% educational value)
 - Low (Bottom 25% educational value)
-Please note textbook quality is a subset of high quality.
-A detailed report/ paper will follow when more downstream experiments of this classifier become available.
-The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
 ## Feedback welcomed!
 Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
@@ -132,9 +133,10 @@ The score can be roughly interpreted as:
 |[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
 |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
 |[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
 \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
 The classifier aligns with the expectation.
 - In general, the synthetic data has higher education value because they are created with a high educational value by design.
@@ -143,4 +145,5 @@ The classifier aligns with the expectation.
 - Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
 - Maths/ paper category scores the second highest, because of its density of knowledge.
 - Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
-- Web scores the lowest (if no filtering is applied) because it contains all domains.

 ---
 # llm-data-textbook-quality-fasttext-classifer-v2
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/pxajU51PTfF9qpDiugh4n.png)
 ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
 This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
 Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
+The model can classify if a text has high educational value (more explicitly defined then textbook quality).  The definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
 It can be used as a filter for data curation when training a LLM.
 There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.
 - High (Top 25% educational value)
 - Mid (Middle 25-75% educational value)
 - Low (Bottom 25% educational value)
+A detailed report/ paper will follow when more downstream experiments of this classifier become available.
+The classifier had been applied to various pretraining dataset. See [**Benchmark**](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2#benchmark)
+Please note textbook quality is a subset of high quality.
 ## Feedback welcomed!
 Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
 |[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
 |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
 |[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
+|[BEE-spoke-data/FineMeme-100k](https://huggingface.co/datasets/BEE-spoke-data/FineMeme-100k)| First 100,000 | 0.716 |Real|
+|[neuralcatcher/hateful_memes](https://huggingface.co/datasets/neuralcatcher/hateful_memes)| First 100,000 | 0.070 |Real|
 \* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
 The classifier aligns with the expectation.
 - In general, the synthetic data has higher education value because they are created with a high educational value by design.
 - Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
 - Maths/ paper category scores the second highest, because of its density of knowledge.
 - Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
+- Web scores low (if no filtering is applied) because it contains all domains.
+- Meme scores the lowest as expected. Hateful memes almost got zero point.