Spaces:

yhavinga
/

dutch-tokenizer-arena

Running

add more tokenizers

f4973d4 about 1 year ago

570 Bytes





	moss-moon-003-base 模型的 tokenizer 中，`eos token` 为 `<\|endoftext\|>`，在训练SFT模型时需要将该 token 指定为 `<eom>` token.


	## SFT 阶段

	- `<eoh>`: end of human
	- `<eot>`: end of thoughts
	- `<eoc>`: end of commands
	- `<eom>`: end of moss



	## 注意

	moss的

	```py
	def convert_tokens_to_string(self, tokens):
	"""Converts a sequence of tokens (string) in a single string."""
	text = "".join(tokens)
	text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
	return text
	```