Space in chat_template

#60
by ThangLD201 - opened

Hi @patrickvonplaten , I noticed that with Mistral-7B-Instruct-v0.3 the chat_template (in tokenizer_config.json) is:

'[INST] ' + message['content'] + '[/INST]'

but with v0.2 it was

'[INST] ' + message['content'] + ' [/INST]'

So the space on the right is missing on v0.3, is this intentional or simply a typo ?

It's intentional. The original Mistral model does not have the second space

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
>>> from mistral_common.protocol.instruct.messages import UserMessage
>>> from mistral_common.protocol.instruct.request import ChatCompletionRequest
>>> query = "Hi, how are you?"
>>> tokenizer = MistralTokenizer.from_file(f"/path/to/tokenizer.model.v3")
>>> completion_request = ChatCompletionRequest(
...     messages=[UserMessage(content=query)]
... )
>>> encoded = tokenizer.encode_chat_completion(completion_request)
>>> encoded
Tokenized(tokens=[1, 3, 16127, 29493, 1678, 1228, 1136, 29572, 4], text='<s>[INST]▁Hi,▁how▁are▁you?[/INST]', prefix_ids=None)
Mistral AI_ org

Sign up or log in to comment