Update README.md
Browse files
README.md
CHANGED
@@ -8,5 +8,18 @@ language:
|
|
8 |
tags:
|
9 |
- tokenizer
|
10 |
- wordlevel
|
|
|
|
|
11 |
inference: false
|
12 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
tags:
|
9 |
- tokenizer
|
10 |
- wordlevel
|
11 |
+
- tokenizers
|
12 |
+
- wikitext
|
13 |
inference: false
|
14 |
+
---
|
15 |
+
|
16 |
+
# WikiText-WordLevel
|
17 |
+
|
18 |
+
This is a simple word-level tokenizer created using the [Tokenizers](https://github.com/huggingface/tokenizers) library. It was trained for educational purposes on the combined train, validation, and test splits of the [WikiText-103](https://huggingface.co/datasets/wikitext) corpus.
|
19 |
+
|
20 |
+
- Tokenizer Type: Word-Level
|
21 |
+
- Vocabulary Size: 75K
|
22 |
+
- Special Tokens: `<s>` (start of sequence), `</s>` (end of sequence), `<unk>` (unknown token)
|
23 |
+
- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
|
24 |
+
- Pre-tokenization: Whitespace
|
25 |
+
- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
|