File size: 2,595 Bytes
f785eab
 
 
 
cc9335d
 
 
 
f785eab
 
e6f7a2f
 
f785eab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dae4a44
f785eab
 
 
dae4a44
f785eab
 
dae4a44
 
f785eab
 
 
dae4a44
f785eab
 
 
 
 
 
 
99be6ac
 
 
 
3b91173
 
 
 
e6f7a2f
 
99be6ac
cc9335d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: mit
datasets:
- kenhktsui/llm-data-quality
language:
- en
library_name: fasttext
pipeline_tag: text-classification
---
# llm-data-textbook-quality-fasttext-classifer-v1
Model is built on fasttext. It is an optimised version of [llm-data-textbook-quality-classifer-v1](https://huggingface.co/kenhktsui/llm-data-textbook-quality-classifer-v1).  
Not just it results in a higher F1 score, but also it can classify more than 2000 examples per second in CPU.  
This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.  
Please note textbook quality is a subset of high quality.  


## Model Performance
|Dataset | F1 Score |
|-------|-------|
|Train | 0.8695|
|Test | 0.8485|


# Usage
```python
from typing import List
import re
import fasttext


model = fasttext.load_model("model_textbook_quality.bin")


def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)


def predict(text_list: List[str]) -> List[dict]:
  text_list = [replace_newlines(text) for text in text_list]
  pred = model.predict(text_list)
  return [{"label": l[0].lstrip("__label__"), "score": s[0]}
           for l, s in zip(*pred)]


predict(["Hi"])
# Output: [{'label': 'LOW_QUALITY', 'score': 1.00001}]

```

## Benchmark

|Dataset | Sampling | Average Quality Score |
|--------------------------------------|---|-------------------|
|[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |Full | 0.8350|
|[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |Full | 0.7535|
|[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |Full | 0.7202|
|[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |Full| 0.5447|
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| Full | 0.4754|
|[pszemraj/simple_wikipedia_LM](https://huggingface.co/datasets/pszemraj/simple_wikipedia_LM) | Full | 0.4704|
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| Full | 0.2963|
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| Full | 0.2562|


Average Quality Score is defined as the average probability output of HIGH_QUALITY.
The classifier aligns with the expectation. Textbook category scores the highest, reflecting the effectiveness of this model. Wikipedia scores lower because it is not textbook after all. Web scores the lowest.