kenhktsui
/

llm-data-textbook-quality-fasttext-classifier-v2

Text Classification

Model card Files Files and versions Community

kenhktsui commited on May 19, 2024

Commit

6b75c18

·

verified ·

1 Parent(s): c484dcf

Update README.md

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -10,8 +10,10 @@ inference: false
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/mC3xwgJJ139R9LXbXpBWj.png)
 This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
-Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so I can be used **on-the-fly**.
 This model can classify if a text has high educational value (more explicitly defined then textbook quality).  This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
 It can be used as a filter for data curation when training a LLM.
 There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.

 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/mC3xwgJJ139R9LXbXpBWj.png)
+## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
 This educational value classifier is deeply inspired by [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644), where a classifier was developed to predict the educational value of data, and was then used for data filtering.
+Model is built on fasttext - it can classify more than 2000 examples per second in CPU, and so it can be used **on-the-fly** during pretraining.
 This model can classify if a text has high educational value (more explicitly defined then textbook quality).  This definition change is a substantial change vs [kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1).
 It can be used as a filter for data curation when training a LLM.
 There are 3 labels instead of 2 labels, as it offers higher granularity of educational value.