Bauwens
/

BPE-32k_SlimPajama-3M

Model card Files Files and versions Community

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

BPE-32k SlimPajama-3M

BPE tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.

Tokeniser details

BPE trainer implementation:

Back-end: SentencePiece's SentencePieceTrainer.
Front-end: TkTkT's BPEVocabulariser

Preprocessor:

During training: TkTkT's SentencePiecePreprocessor
During inference: TkTkT's ModernEnglishPreprocessor
1. NFKC normalisation
2. Punctuation splitter, whitespace splitter, English contraction splitter
3. GPT-2's pseudo-byte mapping
4. Start-of-word marker Ġ
5. Digit and hyphen isolation

Training details

Time: 3h10m

Preprocessing and counting the 3M corpus: 2h45m
BPE merges: 25m

Memory: 33.42 GiB peak usage.

Data sizes:

Examples considered: 3 000 000
Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
Characters counted: 6 685 212 190
Unique words after whitespace splitting: 9 254 839

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including Bauwens/BPE-32k_SlimPajama-3M

Tokenisers

A collection of tokenisers I have trained (so you don't have to). • 2 items • Updated Oct 29, 2024