Tokenisers
Collection
A collection of tokenisers I have trained (so you don't have to).
•
2 items
•
Updated
BPE tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.
BPE trainer implementation:
SentencePieceTrainer
.BPEVocabulariser
Preprocessor:
SentencePiecePreprocessor
ModernEnglishPreprocessor
Ġ
Time: 3h10m
Memory: 33.42 GiB peak usage.
Data sizes: