Spaces:

xu-song
/

tokenizer-arena

Running

File size: 2,106 Bytes


https://arxiv.org/abs/2308.16692 SpeechTokenizer

对于OpenAI的模型而言，英文的Token效率是中文的8-12倍，
之前三百字中文以上时Turbo 3.5 16k就会出现逻辑颠倒问题，提示词换成英文后该问题没有出现过。

## 词典构建

bert词典
gpt词典
gpt-neox词典

## encode


## decode

bert词典有个特殊字符 #

gpt-neox词典呢？
  - _开头表示空格或句首


## 关于分词粒度


## ss



bert-chinese  vocab_size: 21128
bert-en
clue
glm
chatglm
bloom


## 最小词典

mobilenet


## ss


## bert

```
[PAD]
...
[unused99]
[UNK]
[CLS]
[SEP]
[MASK]
<S>
<T>
!
...

big
##ut
ftp
carol
##vi
```


## @@

https://github.com/pytorch/fairseq/blob/master/tests/test_noising.py#L37

```
"he@@", "llo", "n@@", "ew", "y@@", "or@@", "k"
```

跟BERT类似，只不过BERT是词后缀，这里是词前缀。

这种应该是 https://github.com/rsennrich/subword-nmt


## GPT2

词典见：https://huggingface.co/gpt2/raw/main/vocab.json


```
['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
```
跟BERT不同，BERT用特殊符号表示 “连接”，GPT2用特殊符号表示 “空格”。

详见 gpt2/README.md

- 功能符号： `<|endoftext|>` 表示换行。tab？ 空格？
- 很多数字独立编码，几乎上千个。

- 类似的还有：moss


### Ġ是什么

It's a feature of byte-level BPE(an encoded space character). 
Ġ 表示空格，有的版本用Ä代替Ġ。


```sh
What's up with the tokenizer?
# BPE后
['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
# 经过vocab.json编码后
[ 2061,   338,  510,    351,    262,    11241,    7509,   30]
# 经过dict.txt编码后（fairseq特有）
[           其他数字                                         ]
```
<>
疑问：up会加Ġ，为什么what不加Ġ，因为有个pre

- https://github.com/pytorch/fairseq/issues/1716
- https://github.com/huggingface/transformers/issues/1083


## 空格、tab、换行 





## reversible and lossless

It's reversible and lossless, so you can convert tokens back into the original text


## diff