Why are vocab_size and tokenizer different length?

#17
by choco9966 - opened

When I did tokenizer.vocab_size and len(tokenizer), I find that the lengths were different. I was wondering why this is actually different, and I was wondering if there would be no problem from the point of view of inference or continual-training.

>>> tokenizer.vocab_size
262144
>>> len(tokenizer)
262145
Google org

len(tokenizer) counts all vocabulary indices, including 0 while tokenizer.vocab_size represents the number of vocabulary entries without considering the 0-based indexing. This leads to len(tokenizer) being one greater than tokenizer.vocab_size.
Screenshot 2025-04-04 at 12.28.45 PM.png
Please refer to this gist for further clarification.

But the vocab ranges from 0 to 262144. Then shouldn't the vocab size be 262145?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment