Any plans to convert tokenizer into a Fast Tokenizer class?
Currently the tokenizer can only be loaded and saved in a legacy format. Trying to save it in FastTokenizer format
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"aisingapore/sealion7b", trust_remote_code=True
)
tokenizer.save_pretrained(".", legacy_format=False)
will result in the following error:
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "/home/jason/projects/test/test.py", line 6, in <module>
tokenizer.save_pretrained(".", legacy_format=False)
File "/home/jason/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2182, in save_pretrained
save_files = self._save_pretrained(
File "/home/jason/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2214, in _save_pretrained
raise ValueError(
ValueError: Only fast tokenizers (instances of PreTrainedTokenizerFast) can be saved in non legacy format.
FastTokenizers is used by default by the rust web server in TGI, and it will fail to load if there isn't a fast tokenizer implementation for the tokenizer.
There is a conversion script for slow tokenizer classes to fast ones, but since this is a newly defined tokenizer class instead of using an existing tokenizer class, i am unable to convert it into a fast tokenizer.
Hi
@jahhs0n
,
Thank you for checking out SEA-LION!
The SEA-LION tokenizer is trained using the SentencePiece package, hence its not compatible with the fast tokenizer by default.
I've uploaded a version of the fast tokenizer which we converted using the sentencepiece_extractor script from the tokenizer package, you can find it in the SEA-LION Github fasttokenizer branch, in the tokenizer folder.
However, please note that due to the conversion process, it is not possible to replicate the exact SentencePiece model as mentioned in this Github issue, https://github.com/huggingface/tokenizers/issues/225#issuecomment-612140650.
We will be adding some examples which differs between the original SentencePiece model and the fast tokenizer to the README over the next week days. Hope this helps.
thanks for the work! May I know how exactly the conversion is done? My understanding of the process is as thus:
- extract the
vocab.json
file and themerges
file using sentencepiece_extractor.py from huggingface'stokenizers
package - load the
vocab.json
file andmerges
file intoSentencePieceBPETokenizer
class using the from_file method - save the
SentencePieceBPETokenizer
object using the save method?
please correct my understanding if i get any part of that process wrong. Thank you for your help!
Hi
@jahhs0n
,
Yes, your understanding is pretty much spot on, there is also a few additional steps:
- Add special tokens
- Manually add
auto_map
in tokenizer_config.json so that AutoTokenizer recognise the custom tokenizer class
I have also uploaded the notebook which does the conversion for your reference here,
https://github.com/aisingapore/sealion/blob/fasttokenizer/tokenizer/fast_tokenizer_conversion.ipynb
Hope this helps.
Thanks for the guidance! The notebook is especially helpful to see how the conversion is done. Will be closing this issue now
Hi
@jahhs0n
,
I'm glad the notebook is helpful for you.
Unfortunately, I've made a mistake with uploading the wrong tokenizer files for the SEA-LION tokenizer.
I've now replaced it with the correct files and have double checked it from my side. Kindly checkout the latest files for the correct files. My apologies for the mistake.
https://github.com/aisingapore/sealion/tree/fasttokenizer/tokenizer/sealion_fasttokenizer
Thank you!