[Query-ISSUE] tokenizer.vocab_size is 128000, however len(tokenizer) is 128256, which prevents me from using those other tokens.
#34
by
HV-Khurdula
- opened
@HV-Khurdula
The extra 256 are special tokens with token ids ranging from 128000
-128255
.
These are <|begin_of_text|>
, <|end_of_text|>
, <|reserved_special_token_0|>
, etc.. The first two are already in use as BOS and EOS tokens.
You can find the complete list in the tokenizer_config.json
file.
https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/tokenizer_config.json