Why tokenizer.json is 8 vocabs off?

#22
by shx26 - opened

in tokenizer.json, token id 3-5 and 9-15 have no definitions.
This has resulted in a discrepany between model (config.json) vocab size and the tokenizer vocab size.

'''
config = AutoConfig.from_pretrained(model)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
config.vocab_size # 64000
len(tokenizer) # 63992
'''

this characteristic has caused decoding errors when Yi-34B is run with other frameworks like VLLM (e.g. https://github.com/vllm-project/vllm/issues/340). If these skipped over ids have no definition, why don't we realign all ids to 0 and eliminate the gap, or fill in some empty definitions to avoid the error.

01-ai org

Thanks to @shx26 for the question! Those tokens are likely special tokens like <|im_start|>. Regarding the compatibility issue with the VLLM framework, we haven't encountered it before in our previous usage of VLLM. I am currently trying to reproduce the issue and will get back to you with an update shortly!

Sign up or log in to comment