Bauwens commited on
Commit
aa69eb1
1 Parent(s): cfb1893

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BPE-32k SlimPajama-3M
2
+ BPE tokeniser with vocabulary size 32768, trained on the first 3 million examples in [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
3
+
4
+ ## Tokeniser details
5
+ BPE trainer implementation:
6
+ - Back-end: [SentencePiece](https://github.com/google/sentencepiece)'s `SentencePieceTrainer`.
7
+ - Front-end: [TkTkT](https://github.com/bauwenst/TkTkT)'s [`BPEVocabulariser`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/models/bpe/vocabularisation.py#L210)
8
+
9
+ Preprocessor:
10
+ - During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
11
+ - During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
12
+ - NFKC normalisation
13
+ - Punctuation splitter, whitespace splitter, English contraction splitter
14
+ - GPT-2's pseudo-byte mapping
15
+ - Start-of-word marker `Ġ`
16
+ - Digit and hyphen isolation
17
+
18
+ ## Training details:
19
+ **Time:** 3h10m
20
+ - Preprocessing and counting the 3M corpus: 2h45m
21
+ - BPE merges: 25m
22
+
23
+ **Memory:** 33.42 GiB peak usage.
24
+
25
+ **Data sizes:**
26
+ - Examples considered: 3 000 000
27
+ - Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
28
+ - Characters counted: 6 685 212 190
29
+ - Unique words after whitespace splitting: 9 254 839