--- license: apache-2.0 datasets: - HuggingFaceFW/fineweb - HuggingFaceFW/fineweb-edu language: - en --- # TensorFlow Model Garden LMs: FineWeb WordPiece Tokenizer This WordPiece tokenizer was trained as part of the TensorFlow Model Garden LMs project. The tokenizer was trained on the `sample-10BT` packages of the FineWeb and FineWeb-Edu dataset, using a vocabulary size of 64,000 subtokens. A script for training that tokenizer can be found [here](https://github.com/stefan-it/model-garden-lms/blob/main/bert/train_vocab.py).