Finetuning data format (alpaca or chat)
Hi,
I got a little confused regarding how to properly feed our custom data for fine-tuning on a specific task with this model. I am familiar with Alpaca format (instruction, input, response) and mistral (where prompt should be surrounded by [INST] and [/INST] tokens).
Does anybody have an idea what to use for this, is it should be the following style:
<start_of_turn>user
please write a hello world program<end_of_turn>
<start_of_turn>model
I assume "model" after second start_of_turn token represents the answer or response of model?
Thanks a lot!
Hi, the format should be used as described in the instructions:
<start_of_turn>user
please write a hello world program<end_of_turn>
<start_of_turn>model
note the use of < > and the spacing needed for best performance.
"chat_template": "{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif
%}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{
raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if
(message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '
' + role + '\n' + message['content'] | trim + '\n' }}{% endfor %}{% if add_generation_prompt
%}{{'model\n'}}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "",
"legacy": null,
"model_max_length": 1000000000000000019884624838656,
"pad_token": "",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"tokenizer_class": "GemmaTokenizer",
"unk_token": "",
"use_default_system_prompt": false
}
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
you have to compose prompt as format:
Thanks a lot
Yes, indeed, hope this helped!