Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

tokenizer-arena / vocab /llama /README.md

xu-song

update

0ce6477 about 1 year ago

preview code

raw

history blame

2.3 kB



	##


	tokenizer.json 和 tokenizer.model 是都需要吗？

	## 完整性

	以下 256个字符保证了词典的完整性
	```
	"vocab": {
	"<0x00>": 3,
	"<0x01>": 4,
	...
	"<0xFE>": 257,
	"<0xFF>": 258,
	```


	##


	```json
	"normalizer": {
	"type": "Sequence",
	"normalizers": [
	{
	"type": "Prepend",
	"prepend": "▁"
	},
	{
	"type": "Replace",
	"pattern": {
	"String": " "
	},
	"content": "▁"
	}
	]
	},

	"post_processor": {
	"type": "TemplateProcessing",
	"single": [
	{
	"SpecialToken": {
	"id": "<s>",
	"type_id": 0
	}
	},
	{
	"Sequence": {
	"id": "A",
	"type_id": 0
	}
	}
	],
	"pair": [
	{
	"SpecialToken": {
	"id": "<s>",
	"type_id": 0
	}
	},
	{
	"Sequence": {
	"id": "A",
	"type_id": 0
	}
	},
	{
	"Sequence": {
	"id": "B",
	"type_id": 0
	}
	}
	],
	"special_tokens": {
	"<s>": {
	"id": "<s>",
	"ids": [
	1
	],
	"tokens": [
	"<s>"
	]
	}
	}
	},
	"decoder": {
	"type": "Sequence",
	"decoders": [
	{
	"type": "Replace",
	"pattern": {
	"String": "▁"
	},
	"content": " "
	},
	{
	"type": "ByteFallback"
	},
	{
	"type": "Fuse"
	},
	{
	"type": "Strip",
	"content": " ",
	"start": 1,
	"stop": 0
	}
	]
	},

	```

	## issues

	1. https://github.com/LianjiaTech/BELLE/issues/45
	llama 700个中文只是显式支持的数量，隐含支持的unicode中文字远超700，
	你可以随便用一个bert的词表做实验。不过恶心的是这样一个中文字就会encode成4,5个unicode toekn，长度一下就上去了，所以还是哈工大做中文词表增强的靠谱。

	2. https://github.com/LianjiaTech/BELLE/issues/43
	请问各位llama在中文上使用需要对词表做额外操作吗？
	应该是要的，我测了一下llama词表和常用汉字3500个的交集，只有600多个。增加词表可参考https://github.com/ymcui/Chinese-LLaMA-Alpaca