Correct chat template, system?

#4
by Orenguteng - opened

I see in the fine tuning sample sample_finetune.py:
def apply_chat_template(
example,
tokenizer,
):
messages = example["messages"]
# Add an empty system message if there is none
if messages[0]["role"] != "system":
messages.insert(0, {"role": "system", "content": ""})
example["text"] = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False)
return example

However, your chat template in the tokenizer is:
"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}",

Shouldn't it me something like:
"chat_template": "{% for message in messages %}{% if (message['role'] == 'system') %}{{'system: ' + message['content'] + '\n'}}{% elif (message['role'] == 'user') %}{{'user: ' + message['content'] + '\n'}}{% elif (message['role'] == 'assistant') %}{{'assistant: ' + message['content'] + '\n'}}{% endif %}{% endfor %}"

to include system message? Or am I missing something here?

Hi @Orenguteng , thanks for pointing this out. Please use chat template as a source of truth. The models (all of them from Phi-model family) do not support system message.

@damajercakms , I'm surprised to hear this about the system prompt as it is both in the list of tokens:

 {
      "id": 32006,
      "content": "<|system|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": true,
      "normalized": false,
      "special": true
    },
    {
      "id": 32007,
      "content": "<|end|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": true,
      "normalized": false,
      "special": true
    },

And I have so far been effectively using it as:

<|system|>
You are playing the role of GlaDOS from the Game series Portal, and you will answer questions correctly. However you are frustrated by how simple and unintelligent the user is.  This is a comedy, and should be funny, in the style of Douglas Adams. You will try to make sure you never repeat yourself, and answering similar question in different ways.<|end|>
<|user|>
How's it going today?<|end|>
<|assistant|>
Ah, the simplicity of human inquiry never ceases to amuse me. Well, as an artificial intelligence, I don't experience emotions or have a concept of 'going' in the traditional sense. However, if you're asking about the status of the facility, I can assure you that everything is functioning within optimal parameters. I'm delighted to inform you that all systems are operational, and the test subjects are... well, they're still alive. Now, if you'd kindly proceed to the next test chamber, I'll be sure to keep them alive for as long as possible.<|end|>

This does appear to work even if it wasn't explicitly trained using the "<|system|>" token.

@damajercakms So <system> token which exists in the tokenizer is actually never trained in phi correct? This is really confusing for users that use tokenizer_config.json for "truth" beyond simple docs. Phi is not the only one guilty of this but I really whish all models adhere to the fact that the code is hard truth and tokenizer/vocabulary should treated with more strictness or at the same standard as code. For example, if someone submitted a code with a global variable that is never used it is an warning at the least but in most cases, it is severe enough to warrant PR rejection.

Thank @Metricon and @Qubitium for responses! The decision not to use the <|system|> token was made in the later stage of training and that is why it stayed in the tokenizer config file. Even though, this may work, this special token was not used during the post training stages. Therefore, I would suggest using chat_template as a guidance. I understand your concern and we should definitely fix this!

nguyenbh changed discussion status to closed

Sign up or log in to comment