Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# BPE-32k SlimPajama-3M
|
2 |
+
BPE tokeniser with vocabulary size 32768, trained on the first 3 million examples in [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
3 |
+
|
4 |
+
## Tokeniser details
|
5 |
+
BPE trainer implementation:
|
6 |
+
- Back-end: [SentencePiece](https://github.com/google/sentencepiece)'s `SentencePieceTrainer`.
|
7 |
+
- Front-end: [TkTkT](https://github.com/bauwenst/TkTkT)'s [`BPEVocabulariser`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/models/bpe/vocabularisation.py#L210)
|
8 |
+
|
9 |
+
Preprocessor:
|
10 |
+
- During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
|
11 |
+
- During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
|
12 |
+
- NFKC normalisation
|
13 |
+
- Punctuation splitter, whitespace splitter, English contraction splitter
|
14 |
+
- GPT-2's pseudo-byte mapping
|
15 |
+
- Start-of-word marker `Ġ`
|
16 |
+
- Digit and hyphen isolation
|
17 |
+
|
18 |
+
## Training details:
|
19 |
+
**Time:** 3h10m
|
20 |
+
- Preprocessing and counting the 3M corpus: 2h45m
|
21 |
+
- BPE merges: 25m
|
22 |
+
|
23 |
+
**Memory:** 33.42 GiB peak usage.
|
24 |
+
|
25 |
+
**Data sizes:**
|
26 |
+
- Examples considered: 3 000 000
|
27 |
+
- Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
|
28 |
+
- Characters counted: 6 685 212 190
|
29 |
+
- Unique words after whitespace splitting: 9 254 839
|