Mismatch between tokenizer and model vocab size?

#72
by yairschiff - opened

When I run the following snippet:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")

print(f"{tokenizer.vocab_size=} vs. {model.config.vocab_size=}")

There appears to be a mismatch in size:

>>> tokenizer.vocab_size=50280 vs. model.config.vocab_size=50368

What is the correct usage of the tokenizer / model combination here?

Sign up or log in to comment