[Query-ISSUE] tokenizer.vocab_size is 128000, however len(tokenizer) is 128256, which prevents me from using those other tokens.

#34
by HV-Khurdula - opened

image.png

@HV-Khurdula The extra 256 are special tokens with token ids ranging from 128000-128255.

These are <|begin_of_text|>, <|end_of_text|>, <|reserved_special_token_0|>, etc.. The first two are already in use as BOS and EOS tokens.

You can find the complete list in the tokenizer_config.json file.

https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/tokenizer_config.json

Sign up or log in to comment