dustalov
/

wikitext-wordlevel

Model card Files Files and versions

dustalov commited on Jan 22, 2024

Commit

ef99e22

·

verified ·

1 Parent(s): 9cc9058

Update README.md

Files changed (1) hide show

README.md +14 -1

README.md CHANGED Viewed

@@ -8,5 +8,18 @@ language:
 tags:
 - tokenizer
 - wordlevel
 inference: false
----

 tags:
 - tokenizer
 - wordlevel
+- tokenizers
+- wikitext
 inference: false
+---
+# WikiText-WordLevel
+This is a simple word-level tokenizer created using the [Tokenizers](https://github.com/huggingface/tokenizers) library. It was trained for educational purposes on the combined train, validation, and test splits of the [WikiText-103](https://huggingface.co/datasets/wikitext) corpus.
+- Tokenizer Type: Word-Level
+- Vocabulary Size: 75K
+- Special Tokens: `<s>` (start of sequence), `</s>` (end of sequence), `<unk>` (unknown token)
+- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
+- Pre-tokenization: Whitespace
+- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)