jeffreywpli
commited on
Commit
•
cd8b714
1
Parent(s):
0ffc363
Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,10 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
-
Fasttext model used for filtering in [DataComp-LM](https://arxiv.org/abs/2406.11794)
|
5 |
|
6 |
-
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
Fasttext model used for filtering in [DataComp-LM](https://arxiv.org/abs/2406.11794) to produce [DCLM-Baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0).
|
5 |
|
6 |
+
The model classifies between `__label__hq` and `__label__cc` which correspond to "high-quality" (i.e., OH2.5 and Reddit ELI5 data) and "low-quality" (i.e., web-crawled data from Common Crawl) respectively. We use the score given to `__label__hq` to filter our documents via a percentile-based threshold.
|
7 |
+
|
8 |
+
See our [dclm](https://github.com/mlfoundations/dclm/tree/main/baselines#fasttext-filtering) repo for documentation about how we applied to to filter data in our experiments.
|
9 |
+
|
10 |
+
See [fasttext documentation](https://fasttext.cc/docs/en/python-module.html) for general documentation on fasttext classifiers and how to use them with python.
|