Why are vocab_size and tokenizer different length?
#17
by
choco9966
- opened
When I did tokenizer.vocab_size and len(tokenizer), I find that the lengths were different. I was wondering why this is actually different, and I was wondering if there would be no problem from the point of view of inference or continual-training.
>>> tokenizer.vocab_size
262144
>>> len(tokenizer)
262145
len(tokenizer)
counts all vocabulary indices, including 0 while tokenizer.vocab_size
represents the number of vocabulary entries without considering the 0-based indexing. This leads to len(tokenizer)
being one greater than tokenizer.vocab_size
.
Please refer to this gist for further clarification.
But the vocab ranges from 0 to 262144. Then shouldn't the vocab size be 262145?