File size: 2,705 Bytes
95d1d67
467086f
15c17b4
95d1d67
15c17b4
9254b54
 
 
 
 
 
95d1d67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9254b54
 
 
 
 
 
 
 
 
 
 
15c17b4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
pipeline_tag: text-classification
library_name: fasttext
---

<p align="center">
    ๐Ÿ“‘ <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp ๐Ÿ”จ <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp ๐Ÿค— <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp ๐Ÿ“ฆ <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>  
<br>
</p>


## Model Summary
This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper:  [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
The positive label name and negative label name are "__label__1" and "__label__0" respectively.

## How to use
You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:

```python
import os
import argparse
from pathlib import Path

parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output name")

args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)

dist_executor = LocalPipelineExecutor(
    skip_completed=False,
    pipeline=[
        JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
        FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), 
        JsonlWriter(f"{args.output_path}", compression=None)
    ],
    tasks=100,
)
dist_executor.run()
```

## Training
For more training details, you can refer to the paper and the training code is available on GitHub 
[PreSelect](https://github.com/hkust-nlp/preselect).

## Citation
If you find this work helpful, please kindly cite as:
```
@article{shum2025predictivedataselectiondata,
      title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches}, 
      author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
      journal={arXiv preprint arXiv:2503.00808},
      year={2025},
      eprint={2503.00808},
}
```