Special token( </s>) not generating in the model.generate() method

#47
by Pradeep1995 - opened

I finetuned the mistralai/Mistral-7B-Instruct-v0.2 model by using the dataset of a given format

sentence1</s>sentence2
sentence3</s>sentence4

So after tuning, I am trying to do the inference by prompting the first part only (ie sentence1 or sentence3 ).
So I am expecting the response structure such as </s>sentence2 or </s>sentence4

But the finetuned model just produces sentence2 and sentence4 only without generating the </s> special token.

So to generate the </s> token by model.generate() method, how to change the code?

Hi @Pradeep1995
How do you verify that </s> is not generated? Can you make sure you decode all tokens with skip_special_tokens=False ?
Also, it might be possible that the model do not attend to these tokens during training, could you inspect the attention mask of your training protocol and make sure the token </s> is correctly attended?

@ybelkada
Before training, I initiated the tokenizer as follows

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", trust_remote_code=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

I haven't put anything like skip_special_tokens=False/True in the tokenizer before training.

Also after training, during decoding time after inference i tried the both ways with the same tokenizer used above

tokenizer.decode(output_tokens,skip_special_tokens=True)
and
tokenizer.decode(output_tokens,skip_special_tokens=False)

But the model not generating and so not decoding the special tokens.

Is my method correct?

Mistral AI_ org

It's more in the training did you add the eos and bos to every prompt? Also for training eos = pad seems wrong you will always ignore the eos but you need to pay attention to it when you train

sentence1</s>sentence2
sentence3</s>sentence4
.....
.....

This is the format of my training data. I didn't explicitly mention anything like eos and bos in the training data, rather than the </s> In the middle of each data sample.
what I want is for the model should generate the special token(</s>) during the inference in the middle of the sentence rather than at the end.
so how can I modify the code for that? please share the snipped if possible.
@ybelkada

I see, I think by default the DataCollatorForLanguageModeling masks out the EOS token during training, can you share your training snippet?

import torch
from peft import LoraConfig, AutoPeftModelForCausalLM, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
#For model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

#For tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

peft_model = .....
peft_config=  .....
training_arguments = ....
#Dataset format  sentence1</s>sentence2, sentence3</s>sentence4,....etc
data    = dataset

#SFTTrainer
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=data,
    peft_config=peft_config,
    dataset_text_field="prompt",
    max_seq_length=3000,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
trainer.train()

@ybelkada please check

@Pradeep1995
Thanks !
Is your dataset already formatted as sentence1</s>sentence2, sentence3</s>sentence4,....etc ? If that's the case you need to set packing=False. The other solution is to set the token for separating each sentence differently than </s> as that token is already used as the EOS token.
Does also this issue helps : https://github.com/huggingface/trl/issues/1283 ?
Let me know how it goes!

Sign up or log in to comment