fasttext-med-en-zh-identification
This model is an intermediate result of the EPCD (Easy-Data-Clean-Pipeline) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses fastText.
Data Composition
General Chinese Pretraining Dataset
Medical Chinese Pretraining Dataset
General English Pretraining Dataset
Medical English Pretraining Datasets
The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community!
Data Cleaning Process
Initial dataset processing:
- For the Chinese training datasets, the pretraining corpus is split by
\n
, and any leading or trailing spaces are removed. - For the English training datasets, the pretraining corpus is split by
\n
, all letters are converted to lowercase, and any leading or trailing spaces are removed.
- For the Chinese training datasets, the pretraining corpus is split by
Word count statistics:
Sample filtering based on word count (heuristic thresholds):
- For Chinese: Keep only samples with more than 5 words.
- For English: Keep only samples with more than 5 words.
Dataset splitting: 90% of the data is used for training and 10% for testing.
Model Performance
Dataset | Precision | Recall |
---|---|---|
Train | 0.9987 | 0.9987 |
Test | 0.9962 | 0.9962 |
Usage Example
import fasttext
from huggingface_hub import hf_hub_download
def to_low(text):
return text.strip().lower()
model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model('fasttext.bin')
model.predict(to_low('Hello, world!'))