stanford-crfm/marin-tokenizer

This is the Llama 3.1 tokenizer with what is basically the very minimal Olmo 2 template with the Llama 3 custom tokens. CF Llama 3's chat template

The goal is mostly to have a "base model" tokenizer (that uses llama's end_of_text token instead of eot) for hybrid pretraining.

Specifically we use:

{{ bos_token }}
{% for message in messages -%}
<|start_header_id|>{{ message['role'] }}<|end_header_id|>
{{ message['content'] | trim }}<|eot_id|>
{%- endfor %}
{% if add_generation_prompt -%}
<|start_header_id|>assistant<|end_header_id|>
{% endif %}

Comparisons:

======
olmo2
<|endoftext|><|user|>
Hello, how are you?
<|assistant|>
I'm doing well, thanks!<|endoftext|>
<|user|>
That's good to hear!

=======
llama3_instruct
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello, how are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'm doing well, thanks!<|eot_id|><|start_header_id|>user<|end_header_id|>

That's good to hear!<|eot_id|>
======
marin
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
Hello, how are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
I'm doing well, thanks!<|eot_id|><|start_header_id|>user<|end_header_id|>
That's good to hear!<|eot_id|>