Spaces:

xu-song
/

tokenizer-arena

Running

update

751936e about 1 year ago

1.56 kB


	## ss


	![](/images/gptNeoX20B-VS-gpt2.jpg)
	<img title="GPT-2分词与GPT-NeoX-20B分词。GPT-NeoX-20B分词处理空格更好，这对源代码等文本特别有用。" src="/images/gptNeoX20B-VS-gpt2.jpg">

	## 20B

	[configs/20B.yml](https://github.com/EleutherAI/gpt-neox/blob/main/configs/20B.yml#L7)
	```
	"vocab-file": "./20B_checkpoints/20B_tokenizer.json",
	```

	Vocab size: 50277
	self.padded_vocab_size = 50304


	padded vocab (size: 50277) with 27 dummy tokens (new size: 50304)

	## 词典

	见 convert_vocab_to_txt.py

	```
	{"id": 13609, "token": "\u00e4\u00b8\u0143", "token_decode": "\u4e2d"} 中

	# 多个符号拼接在一起的
	{"id": 13663, "token": ".]{}", "token_decode": ".]{}"} .*]{}

	# ss

	```

	## special_tokens

	https://huggingface.co/EleutherAI/gpt-neox-20b/blob/main/special_tokens_map.json
	```
	{"bos_token": "<\|endoftext\|>", "eos_token": "<\|endoftext\|>", "unk_token": "<\|endoftext\|>"}
	```

	https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
	```
	unk_token="<\|endoftext\|>",
	bos_token="<\|endoftext\|>",
	eos_token="<\|endoftext\|>",

	```

	## 中文支持

	基本没有OOV。

	gpt-neox是在800G英文数据集上训练的，为啥词典支持中文？因为是byte-level BPE

	```
	丁 [3218, 212]
	七 [3218, 214]
	万 [3218, 218]
	诀 [11894, 211]
	证 [11894, 212]
	```


	编码长度统计： Counter({2: 4190, 3: 1295, 1: 285})
	平均编码长度： 2.1750433275563257

	## 完整性

	```

	```

	## build tokenizer




	## merge



	"ard less",