xu-song's picture
update
da93e39
|
raw
history blame
594 Bytes

moss-moon-003-base 模型的 tokenizer 中,eos token<|endoftext|>,在训练SFT模型时需要将该 token 指定为 <eom> token.

SFT 阶段

  • <eoh>: end of human
  • <eot>: end of thoughts
  • <eoc>: end of commands
  • <eom>: end of moss

注意

moss的

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        text = "".join(tokens)
        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
        return text

troubleshooting