metadata

license: apache-2.0
datasets:
  - HuggingFaceFW/fineweb
  - HuggingFaceFW/fineweb-edu
language:
  - en

TensorFlow Model Garden LMs: FineWeb WordPiece Tokenizer

This WordPiece tokenizer was trained as part of the TensorFlow Model Garden LMs project.

The tokenizer was trained on the sample-10BT packages of the FineWeb and FineWeb-Edu dataset, using a vocabulary size of 64,000 subtokens.

A script for training that tokenizer can be found here.