Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- kenhktsui/llm-data-quality
|
5 |
+
---
|
6 |
+
# llm-data-textbook-quality-fasttext-classifer-v1
|
7 |
+
Model is built on fasttext. It is an optimisation of llm-data-textbook-quality-classifer-v1.
|
8 |
+
It can classify more than 2000 examples per second in CPU.
|
9 |
+
This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.
|
10 |
+
Please note textbook quality is a subset of high quality.
|
11 |
+
|
12 |
+
|
13 |
+
## Model Performance
|
14 |
+
|Dataset | F1 Score |
|
15 |
+
|-------|-------|
|
16 |
+
|Train | 0.8695|
|
17 |
+
|Test | 0.8485|
|
18 |
+
|
19 |
+
|
20 |
+
# Usage
|
21 |
+
```python
|
22 |
+
from typing import List
|
23 |
+
import re
|
24 |
+
import fasttext
|
25 |
+
|
26 |
+
|
27 |
+
model = fasttext.load_model("model_textbook_quality.bin")
|
28 |
+
|
29 |
+
|
30 |
+
def replace_newlines(text):
|
31 |
+
return re.sub("\n+", " ", text)
|
32 |
+
|
33 |
+
|
34 |
+
def predict(text_list: List[str]):
|
35 |
+
text_list = [replace_newlines(text) for text in text_list]
|
36 |
+
pred = model.predict(text_list)
|
37 |
+
return [{"label": l[0].lstrip("__label__"), "score": s[0]} for l, s in zip(*pred)]
|
38 |
+
|
39 |
+
|
40 |
+
predict(["Hi"])
|
41 |
+
# Output: {'label': 'LOW_QUALITY', 'score': 1.00001}
|
42 |
+
|
43 |
+
```
|
44 |
+
|
45 |
+
## Benchmark
|
46 |
+
|
47 |
+
|Dataset | Sampling | Average Quality Score |
|
48 |
+
|--------------------------------------|---|-------------------|
|
49 |
+
|[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 10,000| 0.8356|
|
50 |
+
|[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 10,000| 0.7488|
|
51 |
+
|[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 10,000| 0.7182|
|
52 |
+
|[vikp/textbook_quality_programming](https://huggingface.co/datasets/vikp/textbook_quality_programming) |First 10,000| 0.5410|
|
53 |
+
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000| 0.4760|
|
54 |
+
|[pszemraj/simple_wikipedia_LM](https://huggingface.co/datasets/pszemraj/simple_wikipedia_LM) | First 10,000| 0.4670|
|
55 |
+
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000| 0.2916|
|
56 |
+
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.2525|
|