Transformers
English
electra
pretraining
Inference Endpoints

TensorFlow Model Garden LMs: FineWeb WordPiece Tokenizer

This WordPiece tokenizer was trained as part of the TensorFlow Model Garden LMs project.

The tokenizer was trained on the sample-10BT packages of the FineWeb and FineWeb-Edu dataset, using a vocabulary size of 64,000 subtokens.

A script for training that tokenizer can be found here.

Downloads last month
5
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train model-garden-lms/fineweb-lms-vocab-64000