File size: 1,509 Bytes
291a8d4
 
 
 
 
a28e354
 
 
291a8d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
---
license: apache-2.0
---
**Model Summary** 
 
This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents. 

Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) dataset page for more details

- **Developers**: IBM Research
- **Release Date**: Feb 10th, 2025
- **License**: Apache 2.0

**Training Data**

The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to [fasttext text classification tutorial](https://fasttext.cc/docs/en/python-module.html) for details. 
Training data is selected as follows. 

- *Positive documents*: 190k synthetic documents randomly sampled from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset, and
  10k documents with high educational value selected as follows: first, 600k random documents from [FineWeb-V1.1.0](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
  are annotated asking Mixtral-8x22B-Instruct to score each document between 1 to 5  for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
  Then, 10k random documents are selected from documents with scores greater than or equal to 4.
- *Negative documents*: 200k random documents out of the 600k Mixtral-annotated documents with scores less than or equal to 2.