cajcodes/DistilBERT-PoliticalBias · Some input texts raise an `IndexError: index out of range in self`

For instance, running

from transformers import DistilBertForSequenceClassification, RobertaTokenizer

model = DistilBertForSequenceClassification.from_pretrained("cajcodes/DistilBERT-PoliticalBias")
tokenizer = RobertaTokenizer.from_pretrained("cajcodes/DistilBERT-PoliticalBias")

sample_text = "The Justice Department is facing mounting criticism after officials said they 've turned up no evidence that would warrant criminal charges in the IRS targeting scandal, with conservatives now calling the investigation a sham."

inputs = tokenizer(sample_text, return_tensors="pt", max_length=512, truncation=True, padding=True)
outputs = model(**inputs)

raises an

IndexError: index out of range in self

Interestingly, changing the last word ("sham") to "shame" makes the code run. Setting the max_length to something low (e.g. 32) also works, as that probably removes the problematic word(s). The error is also raised when the sample_text is e.g. just "conservatives", even though the text above also contains that and works (with "shame" at the end).

The cause of the error is probably a mismatch between the vocabularies of the tokenizer (RoBERTa) and the model (DistilBERT). The token IDs for "a sham" are 0, 102, 31026, 2. However, DistilBERT's vocab size is only 30522, so the third token is invalid. The same applies for "conservatives".

For now, a solution is to replace all token IDs greater than the vocab size by the <unk> token with ID 3:

print(inputs["input_ids"])
print(model.config.vocab_size)
print(tokenizer.unk_token)
print(tokenizer.convert_tokens_to_ids(tokenizer.unk_token))

unk_token_id = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
inputs["input_ids"][inputs["input_ids"] >= model.config.vocab_size] = unk_token_id

Out of curiosity (I'm new to ML), why have you trained the model using a different tokenizer than the DistilBertTokenizer?