Update README.md
Browse files
README.md
CHANGED
@@ -9,13 +9,13 @@ BPE trainer implementation:
|
|
9 |
Preprocessor:
|
10 |
- During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
|
11 |
- During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
|
18 |
-
## Training details
|
19 |
**Time:** 3h10m
|
20 |
- Preprocessing and counting the 3M corpus: 2h45m
|
21 |
- BPE merges: 25m
|
|
|
9 |
Preprocessor:
|
10 |
- During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
|
11 |
- During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
|
12 |
+
1. NFKC normalisation
|
13 |
+
2. Punctuation splitter, whitespace splitter, English contraction splitter
|
14 |
+
3. GPT-2's pseudo-byte mapping
|
15 |
+
4. Start-of-word marker `Ġ`
|
16 |
+
5. Digit and hyphen isolation
|
17 |
|
18 |
+
## Training details
|
19 |
**Time:** 3h10m
|
20 |
- Preprocessing and counting the 3M corpus: 2h45m
|
21 |
- BPE merges: 25m
|