Transformers
English
electra
pretraining
Inference Endpoints
File size: 532 Bytes
c88301f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
---
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb
- HuggingFaceFW/fineweb-edu
language:
- en
---

# TensorFlow Model Garden LMs: FineWeb WordPiece Tokenizer

This WordPiece tokenizer was trained as part of the TensorFlow Model Garden LMs project.

The tokenizer was trained on the `sample-10BT` packages of the FineWeb and FineWeb-Edu dataset, using a vocabulary size of 64,000 subtokens.

A script for training that tokenizer can be found [here](https://github.com/stefan-it/model-garden-lms/blob/main/bert/train_vocab.py).