hkust-nlp
/

preselect-fasttext-classifier

Text Classification

Model card Files Files and versions Community

preselect-fasttext-classifier / README.md

ksshumab's picture

Update README.md

467086f verified 4 days ago

|

2.71 kB

	---
	pipeline_tag: text-classification
	library_name: fasttext
	---

	<p align="center">
	📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp \| &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp \| &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp \| &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
	<br>
	</p>


	## Model Summary
	This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
	](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
	The positive label name and negative label name are "__label__1" and "__label__0" respectively.

	## How to use
	You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:

	```python
	import os
	import argparse
	from pathlib import Path

	parser = argparse.ArgumentParser("Filter")
	parser.add_argument("--input_path",type=str, help="input path name")
	parser.add_argument("--output_path",type=str, help="output name")

	args = parser.parse_args()
	from datatrove.executor import LocalPipelineExecutor
	from datatrove.pipeline.filters import FastTextClassifierFilter
	from datatrove.pipeline.readers import ParquetReader,JsonlReader
	from datatrove.pipeline.writers.jsonl import JsonlWriter
	Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)

	dist_executor = LocalPipelineExecutor(
	skip_completed=False,
	pipeline=[
	JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
	FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]),
	JsonlWriter(f"{args.output_path}", compression=None)
	],
	tasks=100,
	)
	dist_executor.run()
	```

	## Training
	For more training details, you can refer to the paper and the training code is available on GitHub
	[PreSelect](https://github.com/hkust-nlp/preselect).

	## Citation
	If you find this work helpful, please kindly cite as:
	```
	@article{shum2025predictivedataselectiondata,
	title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches},
	author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
	journal={arXiv preprint arXiv:2503.00808},
	year={2025},
	eprint={2503.00808},
	}
	```