Spaces:
Running
Running
``` | |
added vocab (size: 54634) with 22 dummy tokens (new size: 54656) | |
Vocab size: 54634 | |
训练数据 | |
``` | |
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox_japanese/tokenization_gpt_neox_japanese.py | |
## 20B | |
[configs/20B.yml](https://github.com/EleutherAI/gpt-neox/blob/main/configs/20B.yml#L7) | |
``` | |
"vocab-file": "./20B_checkpoints/20B_tokenizer.json", | |
``` | |
Vocab size: 50277 | |
self.padded_vocab_size = 50304 | |
padded vocab (size: 50277) with 27 dummy tokens (new size: 50304) | |
## 词典 | |
见 convert_vocab_to_txt.py | |
``` | |
{"id": 13609, "token": "\u00e4\u00b8\u0143", "token_decode": "\u4e2d"} 中 | |
# 多个符号拼接在一起的 | |
{"id": 13663, "token": ".*]{}", "token_decode": ".*]{}"} .*]{} | |
# ss | |
``` | |
## 中文支持 | |
基本没有OOV。 | |
gpt-neox是在800G英文数据集上训练的,为啥词典支持中文?因为是byte-level BPE | |
``` | |
丁 [3218, 212] | |
七 [3218, 214] | |
万 [3218, 218] | |
诀 [11894, 211] | |
证 [11894, 212] | |
``` | |
编码长度统计: Counter({2: 4190, 3: 1295, 1: 285}) | |
平均编码长度: 2.1750433275563257 | |
## ss | |