Spaces:
Running
Running
## | |
tokenizer.json 和 tokenizer.model 是 都需要吗? | |
## 完整性 | |
以下 256个字符保证了词典的完整性 | |
``` | |
"vocab": { | |
"<0x00>": 3, | |
"<0x01>": 4, | |
... | |
"<0xFE>": 257, | |
"<0xFF>": 258, | |
``` | |
## | |
```json | |
"normalizer": { | |
"type": "Sequence", | |
"normalizers": [ | |
{ | |
"type": "Prepend", | |
"prepend": "▁" | |
}, | |
{ | |
"type": "Replace", | |
"pattern": { | |
"String": " " | |
}, | |
"content": "▁" | |
} | |
] | |
}, | |
"post_processor": { | |
"type": "TemplateProcessing", | |
"single": [ | |
{ | |
"SpecialToken": { | |
"id": "<s>", | |
"type_id": 0 | |
} | |
}, | |
{ | |
"Sequence": { | |
"id": "A", | |
"type_id": 0 | |
} | |
} | |
], | |
"pair": [ | |
{ | |
"SpecialToken": { | |
"id": "<s>", | |
"type_id": 0 | |
} | |
}, | |
{ | |
"Sequence": { | |
"id": "A", | |
"type_id": 0 | |
} | |
}, | |
{ | |
"Sequence": { | |
"id": "B", | |
"type_id": 0 | |
} | |
} | |
], | |
"special_tokens": { | |
"<s>": { | |
"id": "<s>", | |
"ids": [ | |
1 | |
], | |
"tokens": [ | |
"<s>" | |
] | |
} | |
} | |
}, | |
"decoder": { | |
"type": "Sequence", | |
"decoders": [ | |
{ | |
"type": "Replace", | |
"pattern": { | |
"String": "▁" | |
}, | |
"content": " " | |
}, | |
{ | |
"type": "ByteFallback" | |
}, | |
{ | |
"type": "Fuse" | |
}, | |
{ | |
"type": "Strip", | |
"content": " ", | |
"start": 1, | |
"stop": 0 | |
} | |
] | |
}, | |
``` | |
## issues | |
1. https://github.com/LianjiaTech/BELLE/issues/45 | |
llama 700个中文只是显式支持的数量,隐含支持的unicode中文字远超700, | |
你可以随便用一个bert的词表做实验。不过恶心的是这样一个中文字就会encode成4,5个unicode toekn,长度一下就上去了,所以还是哈工大做中文词表增强的靠谱。 | |
2. https://github.com/LianjiaTech/BELLE/issues/43 | |
请问各位llama在中文上使用需要对词表做额外操作吗? | |
应该是要的,我测了一下llama词表和常用汉字3500个的交集,只有600多个。增加词表可参考https://github.com/ymcui/Chinese-LLaMA-Alpaca |