Bauwens
/

BPE-32k_SlimPajama-3M

Model card Files Files and versions Community

Bauwens commited on 8 days ago

Commit

aa69eb1

•

1 Parent(s): cfb1893

Create README.md

Files changed (1) hide show

README.md +29 -0

README.md ADDED Viewed

	@@ -0,0 +1,29 @@

+# BPE-32k SlimPajama-3M
+BPE tokeniser with vocabulary size 32768, trained on the first 3 million examples in [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
+## Tokeniser details
+BPE trainer implementation:
+  - Back-end: [SentencePiece](https://github.com/google/sentencepiece)'s `SentencePieceTrainer`.
+  - Front-end: [TkTkT](https://github.com/bauwenst/TkTkT)'s [`BPEVocabulariser`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/models/bpe/vocabularisation.py#L210)
+Preprocessor:
+  - During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
+  - During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
+    - NFKC normalisation
+    - Punctuation splitter, whitespace splitter, English contraction splitter
+    - GPT-2's pseudo-byte mapping
+    - Start-of-word marker `Ġ`
+    - Digit and hyphen isolation
+## Training details:
+**Time:** 3h10m
+  - Preprocessing and counting the 3M corpus: 2h45m
+  - BPE merges: 25m
+**Memory:** 33.42 GiB peak usage.
+**Data sizes:**
+  - Examples considered: 3 000 000
+  - Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
+  - Characters counted: 6 685 212 190
+  - Unique words after whitespace splitting: 9 254 839