Mismatch between tokenizer and model vocab size?
#72
by
yairschiff
- opened
When I run the following snippet:
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
print(f"{tokenizer.vocab_size=} vs. {model.config.vocab_size=}")
There appears to be a mismatch in size:
>>> tokenizer.vocab_size=50280 vs. model.config.vocab_size=50368
What is the correct usage of the tokenizer / model combination here?