Spaces:
Running
Running
## ss | |
![](/images/gptNeoX20B-VS-gpt2.jpg) | |
<img title="GPT-2分词与GPT-NeoX-20B分词。GPT-NeoX-20B分词处理空格更好,这对源代码等文本特别有用。" src="/images/gptNeoX20B-VS-gpt2.jpg"> | |
## 20B | |
[configs/20B.yml](https://github.com/EleutherAI/gpt-neox/blob/main/configs/20B.yml#L7) | |
``` | |
"vocab-file": "./20B_checkpoints/20B_tokenizer.json", | |
``` | |
Vocab size: 50277 | |
self.padded_vocab_size = 50304 | |
padded vocab (size: 50277) with 27 dummy tokens (new size: 50304) | |
## 词典 | |
见 convert_vocab_to_txt.py | |
```sh | |
{"id": 13609, "token": "\u00e4\u00b8\u0143", "token_decode": "\u4e2d"} 中 | |
# 多个符号拼接在一起的 | |
{"id": 13663, "token": ".*]{}", "token_decode": ".*]{}"} .*]{} | |
# ss | |
# 基本字节 | |
(\u0021-\u007E) + (\u00A1-\u0143) | |
``` | |
## special_tokens | |
https://huggingface.co/EleutherAI/gpt-neox-20b/blob/main/special_tokens_map.json | |
``` | |
{"bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>"} | |
``` | |
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py | |
``` | |
unk_token="<|endoftext|>", | |
bos_token="<|endoftext|>", | |
eos_token="<|endoftext|>", | |
``` | |
## 中文支持 | |
基本没有OOV。 | |
gpt-neox是在800G英文数据集上训练的,为啥词典支持中文?因为是byte-level BPE | |
``` | |
丁 [3218, 212] | |
七 [3218, 214] | |
万 [3218, 218] | |
诀 [11894, 211] | |
证 [11894, 212] | |
``` | |
编码长度统计: Counter({2: 4190, 3: 1295, 1: 285}) | |
平均编码长度: 2.1750433275563257 | |
## 完整性 | |
``` | |
``` | |
## build tokenizer | |
## merge | |
"ard less", | |
## hf格式 | |
https://huggingface.co/EleutherAI/gpt-neox-20b/tree/main | |