Spaces:
Running
Running
File size: 1,558 Bytes
751936e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
## ss
![](/images/gptNeoX20B-VS-gpt2.jpg)
<img title="GPT-2分词与GPT-NeoX-20B分词。GPT-NeoX-20B分词处理空格更好,这对源代码等文本特别有用。" src="/images/gptNeoX20B-VS-gpt2.jpg">
## 20B
[configs/20B.yml](https://github.com/EleutherAI/gpt-neox/blob/main/configs/20B.yml#L7)
```
"vocab-file": "./20B_checkpoints/20B_tokenizer.json",
```
Vocab size: 50277
self.padded_vocab_size = 50304
padded vocab (size: 50277) with 27 dummy tokens (new size: 50304)
## 词典
见 convert_vocab_to_txt.py
```
{"id": 13609, "token": "\u00e4\u00b8\u0143", "token_decode": "\u4e2d"} 中
# 多个符号拼接在一起的
{"id": 13663, "token": ".*]{}", "token_decode": ".*]{}"} .*]{}
# ss
```
## special_tokens
https://huggingface.co/EleutherAI/gpt-neox-20b/blob/main/special_tokens_map.json
```
{"bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>"}
```
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
```
unk_token="<|endoftext|>",
bos_token="<|endoftext|>",
eos_token="<|endoftext|>",
```
## 中文支持
基本没有OOV。
gpt-neox是在800G英文数据集上训练的,为啥词典支持中文?因为是byte-level BPE
```
丁 [3218, 212]
七 [3218, 214]
万 [3218, 218]
诀 [11894, 211]
证 [11894, 212]
```
编码长度统计: Counter({2: 4190, 3: 1295, 1: 285})
平均编码长度: 2.1750433275563257
## 完整性
```
```
## build tokenizer
## merge
"ard less",
|