|
|
|
## 词典构建 |
|
|
|
bert词典 |
|
gpt词典 |
|
gpt-neox词典 |
|
|
|
## encode |
|
|
|
|
|
## decode |
|
|
|
bert词典有个特殊字符 # |
|
|
|
gpt-neox词典呢? |
|
- _开头表示空格或句首 |
|
|
|
|
|
## 关于分词粒度 |
|
|
|
|
|
## ss |
|
|
|
|
|
|
|
bert-chinese vocab_size: 21128 |
|
bert-en |
|
clue |
|
glm |
|
chatglm |
|
bloom |
|
|
|
|
|
## bert |
|
|
|
``` |
|
[PAD] |
|
... |
|
[unused99] |
|
[UNK] |
|
[CLS] |
|
[SEP] |
|
[MASK] |
|
<S> |
|
<T> |
|
! |
|
... |
|
|
|
big |
|
##ut |
|
ftp |
|
carol |
|
##vi |
|
``` |
|
|
|
|
|
## |
|
|
|
https://github.com/pytorch/fairseq/blob/master/tests/test_noising.py#L37 |
|
|
|
``` |
|
"he@@", "llo", "n@@", "ew", "y@@", "or@@", "k" |
|
``` |
|
|
|
跟BERT类似,只不过BERT是词后缀,这里是词前缀。 |
|
|
|
|
|
## GPT2 |
|
|
|
词典见:https://huggingface.co/gpt2/raw/main/vocab.json |
|
|
|
|
|
``` |
|
['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?'] |
|
``` |
|
跟BERT不同,BERT用特殊符号表示 “连接”,GPT2用特殊符号表示 “空格”。 |
|
|
|
详见 gpt2/README.md |
|
|
|
- 功能符号: `<|endoftext|>` 表示换行。tab? 空格? |
|
- 很多数字独立编码,几乎上千个。 |
|
|
|
- 类似的还有:moss |
|
|
|
## 空格、tab、换行 |
|
|
|
|
|
|
|
## |