kenhktsui commited on
Commit
f785eab
·
verified ·
1 Parent(s): 90e25b5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - kenhktsui/llm-data-quality
5
+ ---
6
+ # llm-data-textbook-quality-fasttext-classifer-v1
7
+ Model is built on fasttext. It is an optimisation of llm-data-textbook-quality-classifer-v1.
8
+ It can classify more than 2000 examples per second in CPU.
9
+ This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.
10
+ Please note textbook quality is a subset of high quality.
11
+
12
+
13
+ ## Model Performance
14
+ |Dataset | F1 Score |
15
+ |-------|-------|
16
+ |Train | 0.8695|
17
+ |Test | 0.8485|
18
+
19
+
20
+ # Usage
21
+ ```python
22
+ from typing import List
23
+ import re
24
+ import fasttext
25
+
26
+
27
+ model = fasttext.load_model("model_textbook_quality.bin")
28
+
29
+
30
+ def replace_newlines(text):
31
+ return re.sub("\n+", " ", text)
32
+
33
+
34
+ def predict(text_list: List[str]):
35
+ text_list = [replace_newlines(text) for text in text_list]
36
+ pred = model.predict(text_list)
37
+ return [{"label": l[0].lstrip("__label__"), "score": s[0]} for l, s in zip(*pred)]
38
+
39
+
40
+ predict(["Hi"])
41
+ # Output: {'label': 'LOW_QUALITY', 'score': 1.00001}
42
+
43
+ ```
44
+
45
+ ## Benchmark
46
+
47
+ |Dataset | Sampling | Average Quality Score |
48
+ |--------------------------------------|---|-------------------|
49
+ |[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 10,000| 0.8356|
50
+ |[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 10,000| 0.7488|
51
+ |[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 10,000| 0.7182|
52
+ |[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First 10,000| 0.5410|
53
+ |[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000| 0.4760|
54
+ |[pszemraj/simple_wikipedia_LM](https://huggingface.co/datasets/pszemraj/simple_wikipedia_LM) | First 10,000| 0.4670|
55
+ |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000| 0.2916|
56
+ |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.2525|