Update README.md
Browse files
README.md
CHANGED
@@ -8,8 +8,8 @@ inference: false
|
|
8 |
---
|
9 |
# llm-data-textbook-quality-fasttext-classifer-v2
|
10 |
|
11 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/pxajU51PTfF9qpDiugh4n.png)
|
12 |
|
|
|
13 |
|
14 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
15 |
|
@@ -122,16 +122,20 @@ The score can be roughly interpreted as:
|
|
122 |
|[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
|
123 |
|[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
|
124 |
|[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
|
|
|
125 |
|[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
|
126 |
|[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|
|
127 |
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 100,000 | 1.056 |Real|
|
128 |
|[NousResearch/dolma-v1_7-305B*](https://huggingface.co/datasets/NousResearch/dolma-v1_7-305B) |First 100,000 | 1.037 |Real|
|
|
|
129 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 100,000 | 1.019 |Real|
|
130 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 100,000 | 0.998 |Real|
|
131 |
|[togethercomputer/RedPajama-Data-V2 en 2023-06](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 100,000 | 0.985|Real|
|
132 |
|[wikipedia en 20220301](https://huggingface.co/datasets/wikipedia) |First 100,000 | 0.975 |Real|
|
|
|
133 |
|[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
|
134 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
|
|
|
135 |
|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
|
136 |
|[BEE-spoke-data/FineMeme-100k](https://huggingface.co/datasets/BEE-spoke-data/FineMeme-100k)| First 100,000 | 0.716 |Real|
|
137 |
|[neuralcatcher/hateful_memes](https://huggingface.co/datasets/neuralcatcher/hateful_memes)| First 100,000 | 0.070 |Real|
|
@@ -144,6 +148,7 @@ The classifier aligns with the expectation.
|
|
144 |
- In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
|
145 |
- Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
|
146 |
- Maths/ paper category scores the second highest, because of its density of knowledge.
|
|
|
147 |
- Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
|
148 |
- Web scores low (if no filtering is applied) because it contains all domains.
|
149 |
- Meme scores the lowest as expected. Hateful memes almost got zero point.
|
|
|
8 |
---
|
9 |
# llm-data-textbook-quality-fasttext-classifer-v2
|
10 |
|
|
|
11 |
|
12 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/IPmnl6Fc4bvUYnpkVZg8N.png)
|
13 |
|
14 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
15 |
|
|
|
122 |
|[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
|
123 |
|[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
|
124 |
|[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
|
125 |
+
|[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |First 100,000 | 1.115 |Real|
|
126 |
|[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
|
127 |
|[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|
|
128 |
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 100,000 | 1.056 |Real|
|
129 |
|[NousResearch/dolma-v1_7-305B*](https://huggingface.co/datasets/NousResearch/dolma-v1_7-305B) |First 100,000 | 1.037 |Real|
|
130 |
+
|[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) |First 100,000 | 1.020 |Synthetic|
|
131 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 100,000 | 1.019 |Real|
|
132 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 100,000 | 0.998 |Real|
|
133 |
|[togethercomputer/RedPajama-Data-V2 en 2023-06](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 100,000 | 0.985|Real|
|
134 |
|[wikipedia en 20220301](https://huggingface.co/datasets/wikipedia) |First 100,000 | 0.975 |Real|
|
135 |
+
|[Replete-AI/code_bagel](https://huggingface.co/datasets/Replete-AI/code_bagel)| First 100,000 | 0.950 |Synthetic|
|
136 |
|[allenai/c4 en](https://huggingface.co/datasets/allenai/c4)| First 100,000| 0.934 |Real|
|
137 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 100,000 | 0.857 |Real|
|
138 |
+
|[iamtarun/python_code_instructions_18k_alpaca](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca)| First 100,000 | 0.849 |Synthetic|
|
139 |
|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 100,000 | 0.835 |Real|
|
140 |
|[BEE-spoke-data/FineMeme-100k](https://huggingface.co/datasets/BEE-spoke-data/FineMeme-100k)| First 100,000 | 0.716 |Real|
|
141 |
|[neuralcatcher/hateful_memes](https://huggingface.co/datasets/neuralcatcher/hateful_memes)| First 100,000 | 0.070 |Real|
|
|
|
148 |
- In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
|
149 |
- Textbook category (mostly synethetic) scores the highest, becuase they are created for educational value, reflecting the effectiveness of this model.
|
150 |
- Maths/ paper category scores the second highest, because of its density of knowledge.
|
151 |
+
- Instruction dataset scores less than textbook because depth of knowledge in conversation is usually less dense in textbook, but they are in general more educative than unfiltered web.
|
152 |
- Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
|
153 |
- Web scores low (if no filtering is applied) because it contains all domains.
|
154 |
- Meme scores the lowest as expected. Hateful memes almost got zero point.
|