Spaces:
Running
Running
https://arxiv.org/abs/2308.16692 SpeechTokenizer | |
对于OpenAI的模型而言,英文的Token效率是中文的8-12倍, | |
之前三百字中文以上时Turbo 3.5 16k就会出现逻辑颠倒问题,提示词换成英文后该问题没有出现过。 | |
## 词典构建 | |
bert词典 | |
gpt词典 | |
gpt-neox词典 | |
## encode | |
## decode | |
bert词典有个特殊字符 # | |
gpt-neox词典呢? | |
- _开头表示空格或句首 | |
## 关于分词粒度 | |
## ss | |
bert-chinese vocab_size: 21128 | |
bert-en | |
clue | |
glm | |
chatglm | |
bloom | |
## 最小词典 | |
mobilenet | |
## ss | |
## bert | |
``` | |
[PAD] | |
... | |
[unused99] | |
[UNK] | |
[CLS] | |
[SEP] | |
[MASK] | |
<S> | |
<T> | |
! | |
... | |
big | |
##ut | |
ftp | |
carol | |
##vi | |
``` | |
## @@ | |
https://github.com/pytorch/fairseq/blob/master/tests/test_noising.py#L37 | |
``` | |
"he@@", "llo", "n@@", "ew", "y@@", "or@@", "k" | |
``` | |
跟BERT类似,只不过BERT是词后缀,这里是词前缀。 | |
这种应该是 https://github.com/rsennrich/subword-nmt | |
## GPT2 | |
词典见:https://huggingface.co/gpt2/raw/main/vocab.json | |
``` | |
['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?'] | |
``` | |
跟BERT不同,BERT用特殊符号表示 “连接”,GPT2用特殊符号表示 “空格”。 | |
详见 gpt2/README.md | |
- 功能符号: `<|endoftext|>` 表示换行。tab? 空格? | |
- 很多数字独立编码,几乎上千个。 | |
- 类似的还有:moss | |
### Ġ是什么 | |
It's a feature of byte-level BPE(an encoded space character). | |
Ġ 表示空格,有的版本用Ä代替Ġ。 | |
```sh | |
What's up with the tokenizer? | |
# BPE后 | |
['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?'] | |
# 经过vocab.json编码后 | |
[ 2061, 338, 510, 351, 262, 11241, 7509, 30] | |
# 经过dict.txt编码后(fairseq特有) | |
[ 其他数字 ] | |
``` | |
<> | |
疑问:up会加Ġ,为什么what不加Ġ,因为有个pre | |
- https://github.com/pytorch/fairseq/issues/1716 | |
- https://github.com/huggingface/transformers/issues/1083 | |
## 空格、tab、换行 | |
## reversible and lossless | |
It's reversible and lossless, so you can convert tokens back into the original text | |
## diff | |