Transformers documentation
Tiktoken and interaction with Transformers
Tiktoken and interaction with Transformers
Support for tiktoken model files is seamlessly integrated in 🤗 transformers when loading models
from_pretrained with a tokenizer.model tiktoken file on the Hub, which is automatically converted into our
fast tokenizer.
Known models that were released with a tiktoken.model :
- gpt2
- llama3
Example usage
In order to load tiktoken files in transformers, ensure that the tokenizer.model file is a tiktoken file and it
will automatically be loaded when loading from_pretrained. Here is how one would load a tokenizer and a model, which
can be loaded from the exact same file:
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original") Create tiktoken tokenizer
The tokenizer.model file contains no information about additional tokens or pattern strings. If these are important, convert the tokenizer to tokenizer.json, the appropriate format for PreTrainedTokenizerFast.
Generate the tokenizer.model file with tiktoken.get_encoding and then convert it to tokenizer.json with convert_tiktoken_to_fast.
from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding
# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")The resulting tokenizer.json file is saved to the specified directory and can be loaded with PreTrainedTokenizerFast.
tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")