Error when loading tokenizer

#1
by danielschnell - opened

I get the following error when the tokenizer is loaded:

Traceback (most recent call last):
  File "/Users/dschnell/test/embeddings.py", line 66, in <module>
    model, tokenizer = load_model_and_tokenizer(args.model)
  File "/Users/dschnell/test/embeddings.py", line 7, in load_model_and_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 702, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 155, in __init__
    super().__init__(
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: expected value at line 1 column 1

This is with Python 3.10

MaCoCu org

Do you need Python 3.10 in your project? I just tried it with 3.9.13 and it works.

Probably I could go with 3.9 as well. I just wonder, why the above error does at all occur? I have loaded successfully 3 other models, including xml-roberta-base and never had this problem with Python 3.10.

MaCoCu org

Yes, that's weird, thanks for bringing it up. I reallly don't have a good hypothesis as to why this would be the case. If you, by any chance, figure it out, please let us know. I might look into it later.

MaCoCu org

With Python 3.10.5 and a freshly new installed virtual environment, this code works for me:

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained('MaCoCu/XLMR-MaCoCu-is', use_fast=True)
t.tokenize('Yes, this is a test')

Could you please post a piece of your code that triggers the error?

Here is the gist: https://gist.github.com/lumpidu/f9d068146564f9aea94e42ed2c04f68d
Here is the list of installed packages: https://gist.github.com/lumpidu/2de7636408148f6d39475d1074592b7d

The error happens right in Line 8. I have also now installed it on a Linux machine with Python 3.8.10. Same error.

But if I execute your code snippet from above it works ... hmm, it seems my git clone of your repo mixed sth. up .. will investigate

MaCoCu org

That error usually happens when you try to parse as json something that is completely unrelated. Note that despite other models, this model has the tokenizer.json file also tracked by git LFS because it's bigger than 10MB. I just did git clone and the file is just a pointer to the git LFS object, you might need to do something else to clone it completely.

$ cat tokenizer.json
version https://git-lfs.github.com/spec/v1
oid sha256:62c24cdc13d4c9952d63718d6c9fa4c287974249e16b7ade6d5a85e7bbb75626
size 17082660

Jeps, that was the problem needed to pull it. Now it works. Thanks for your help!

Btw. (capturing my own thread ...) I have some more questions:

  • How long did it take to fine-tune the model with how much GPU power ?
  • Are you aware of the other Icelandic text resources available at https://clarin.is/en/resources ?

Notably from a size POV:

  • Icelandic Gigaword Corpus (IGC, 8,2GB)
  • Icelandic Common Crawl Corpus (4,9GB)

How does the corpus you used relate to these ?

Thx for the link. Closing the issue

danielschnell changed discussion status to closed

Sign up or log in to comment