problem loading tokenizer

#4
by matatonic - opened

I'm seeing the following issue:

>>> from transformers import AutoTokenizer
>>> model_id = 'echo840/Monkey'
>>> tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 288/288 [00:00<00:00, 2.87MB/s]
tokenization_qwen.py: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 21.3k/21.3k [00:00<00:00, 84.0MB/s]
A new version of the following files was downloaded from https://huggingface.co/echo840/Monkey:
- tokenization_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
qwen.tiktoken: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.56M/2.56M [00:00<00:00, 30.2MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 35.0/35.0 [00:00<00:00, 333kB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 829, in from_pretrained
    return tokenizer_class.from_pretrained(
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "hf_home/modules/transformers_modules/echo840/Monkey/e12c9762d453211a1f3d8f5545b3bbfd70d4d1b7/tokenization_qwen.py", line 114, in __init__
    super().__init__(**kwargs)
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "hf_home/modules/transformers_modules/echo840/Monkey/e12c9762d453211a1f3d8f5545b3bbfd70d4d1b7/tokenization_qwen.py", line 217, in _add_tokens
    if surface_form not in SPECIAL_TOKENS + self.IMAGE_ST:
AttributeError: 'QWenTokenizer' object has no attribute 'IMAGE_ST'

It seems the super().init(**kwargs) is calling _add_tokens() before the self.IMAGE_ST is being added.
I tried this with transformers-4.39.2 and 4.40.0.dev0, same result.

matatonic changed discussion title from problem loading tokrnizer to problem loading tokenizer

Hello, you should either use transformers==4.32.0 or refer to this link for fixing: https://huggingface.co/echo840/Monkey-Chat/discussions/1.

Based on the copyright, I don't think I can share my fix if I do fix it. I'm trying to include support for Monkey into another project but I am not sure if I can re-distribute a modified (fixed) version of Monkey. Will you not fix it? The workaround seems simple enough and should cause no harm to any existing users.
transformers continues to update, I cannot use transformers==4.32.0 for my project and if I can't share a fix... What can others do?

Thank you! I have resolved the issue. Please give it another try, and if you have any questions, please inform me.

It's working now, thank you! I've support for it to my project: https://github.com/matatonic/openedai-vision
Congratulations, it works very well and thanks again!

matatonic changed discussion status to closed

That sounds great! Thank you for your contribution.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment