Is the code for building the tokenizer open sourced?

#3
by Akirami - opened

I want to know how the tokenizer was built and if possible the whole training process

Im almost certain its a variation of o200k tokenization

I thought they developed it on Llama Tokenizer

Sarvam AI org

It is a vanilla sentencepiece tokenizer trained on a subset of our training data. No fancy stuff.

Cool. Thanks for letting me know

Akirami changed discussion status to closed

Sign up or log in to comment