Bauwens commited on
Commit
9ede907
1 Parent(s): aa69eb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -9,13 +9,13 @@ BPE trainer implementation:
9
  Preprocessor:
10
  - During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
11
  - During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
12
- - NFKC normalisation
13
- - Punctuation splitter, whitespace splitter, English contraction splitter
14
- - GPT-2's pseudo-byte mapping
15
- - Start-of-word marker `Ġ`
16
- - Digit and hyphen isolation
17
 
18
- ## Training details:
19
  **Time:** 3h10m
20
  - Preprocessing and counting the 3M corpus: 2h45m
21
  - BPE merges: 25m
 
9
  Preprocessor:
10
  - During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
11
  - During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
12
+ 1. NFKC normalisation
13
+ 2. Punctuation splitter, whitespace splitter, English contraction splitter
14
+ 3. GPT-2's pseudo-byte mapping
15
+ 4. Start-of-word marker `Ġ`
16
+ 5. Digit and hyphen isolation
17
 
18
+ ## Training details
19
  **Time:** 3h10m
20
  - Preprocessing and counting the 3M corpus: 2h45m
21
  - BPE merges: 25m