Suggested tokenizer changes by Unsloth.ai
Oh hey @gugarosa :) Hopefully these fixes are all correct! Also wrote up a blog post about it here: https://unsloth.ai/blog/phi4
If you need any help, ask away!
What I don't understand is the <|dummy_87|>
choice for the padding token. What is the purpose of the <|dummy_x|>
special tokens?
Thanks @danielhanchen . The blog post was really helpful, we are just running some extra tests to ensure that no capability is lost, but everything is looking good so far.
@dkleine
Since we padded to the vocabulary size to a multiple of 64 (for better performance), we had to add a set of unused tokens to it, which ended up being called as "dummy" tokens. These tokens were not used during pre-training or fine-tuning, but they can later be replaced and used to encode a new functionality to the model, for example, a <|im_retrieve|>
token or something else.
The <|dummy_87|>
was purely arbitrary, probably because it was the last token in the vocabulary. It could have been any other dummy token. Even more, it could also be replaced by another string, let's say, <|im_pad|>
, since it has never been used before.
Was the model trained using the same token for both bos and eos? If so, how can modifying one now not disrupt the modelโs performance, given these tokens define sequence boundaries and altering them could cause premature stopping, incoherent generation, misaligned embeddings, and degraded task performance? @danielhanchen mentioned better metrics for unsloth/phi-4 but do they capture premature stopping or incoherent generation etc?
@jonatbullpointdotorg that's one of the points we are ablating with the proposed change (and why we are taking some time before merging it). As far as I know, BOS has not been used and the model was fine-tuned using <|im_end|>
. However, it does have some pre-training information with <|endoftext|>
, which could lead to the issues you mentioned.