dustalov commited on
Commit
ef99e22
·
verified ·
1 Parent(s): 9cc9058

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -8,5 +8,18 @@ language:
8
  tags:
9
  - tokenizer
10
  - wordlevel
 
 
11
  inference: false
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
8
  tags:
9
  - tokenizer
10
  - wordlevel
11
+ - tokenizers
12
+ - wikitext
13
  inference: false
14
+ ---
15
+
16
+ # WikiText-WordLevel
17
+
18
+ This is a simple word-level tokenizer created using the [Tokenizers](https://github.com/huggingface/tokenizers) library. It was trained for educational purposes on the combined train, validation, and test splits of the [WikiText-103](https://huggingface.co/datasets/wikitext) corpus.
19
+
20
+ - Tokenizer Type: Word-Level
21
+ - Vocabulary Size: 75K
22
+ - Special Tokens: `<s>` (start of sequence), `</s>` (end of sequence), `<unk>` (unknown token)
23
+ - Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
24
+ - Pre-tokenization: Whitespace
25
+ - Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)