EOS and PAD tokens
The special_tokens_map.json
specifies the eos
and pad
tokens as #
and "
respectively, which seems like a weird choice.
{
"eos_token": "#",
"pad_token": "\"",
"unk_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}
Is this correct? Has the model been trained on these token maps? Has the model seen the <|endoftext|>
token during training?
I'm also seeing that, I don't know how that would affect the future, also I don't see a template
IME after finetuning it leads to model preferring single quotes instead of double quotes as it really confuses DataCollatorForLanguageModeling.
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
features = tokenizer('I said "Hi"', return_tensors="pt")
collator([features])
produces
"{'input_ids': tensor([[[ 40, 531, 220, 1, 17250, 1]]]), 'attention_mask': tensor([[[1, 1, 1, 1, 1, 1]]]), 'labels': tensor([[[ 40, 531, 220, -100, 17250, -100]]])}"
the model never learns to output a single double quote.
also I don't see a template
It's not a chat model.
The
special_tokens_map.json
specifies theeos
andpad
tokens as#
and"
respectively, which seems like a weird choice.
{ "eos_token": "#", "pad_token": "\"", "unk_token": { "content": "<|endoftext|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false } }
Is this correct? Has the model been trained on these token maps? Has the model seen the
<|endoftext|>
token during training?
I have the same question. It's very strange.