Why are vocab_size and tokenizer different length?

#17

by choco9966 - opened 12 days ago

12 days ago

When I did tokenizer.vocab_size and len(tokenizer), I find that the lengths were different. I was wondering why this is actually different, and I was wondering if there would be no problem from the point of view of inference or continual-training.

>>> tokenizer.vocab_size
262144
>>> len(tokenizer)
262145

Renu11

Google org 9 days ago

len(tokenizer) counts all vocabulary indices, including 0 while tokenizer.vocab_size represents the number of vocabulary entries without considering the 0-based indexing. This leads to len(tokenizer) being one greater than tokenizer.vocab_size.

Please refer to this gist for further clarification.

choco9966

9 days ago

But the vocab ranges from 0 to 262144. Then shouldn't the vocab size be 262145?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment