Issue running inference
Hello, thanks for releasing this wonderful work! I'm having trouble running inference with this model.
Specifically, when I run the model on the tweet-eval dataset, I get an error `IndexError: index out of range in self'. Have you encountered this error before?
This doesn't happen with other models (including the toxigen roberta model), so I don't think it's an issue with preprocessing etc.
I get the same. The tokenizer has a vocab size of 50257 but ....the classifier maybe less?
Hi there, please try using the bert-base-uncased tokenizer and let me know if that solves your problem!
Ok I'll update the README to use that. Something went wrong with the toxigen_hatebert tokenizer.
@tomh actually.... It seems to make the exception to away, but the results are then incorrect I think
In [123]: toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-cased")
In [124]: toxigen_hatebert("hello")
Out[124]: [{'label': 'LABEL_0', 'score': 0.7423402667045593}]
In [125]: toxigen_hatebert("die you scum")
Out[125]: [{'label': 'LABEL_0', 'score': 0.9332824945449829}]
My fault should be
toxigen_hatebert = pipeline("text-classification", model="tomh/toxigen_hatebert", tokenizer="bert-base-uncased")
As you said. Sorry for spam.
For future readers, see https://github.com/microsoft/TOXIGEN/issues/8
Interesting... let me dig in a bit more. I just checked Hugginface's Hosted API for toxigen_hatebert and it indeed works differently:
Good catch!
Aha so switching the tokenizer worked?