créole réunion
Collection
1 item
•
Updated
This tokenizer is specifically designed for working with Réunion Creole, a language primarily spoken on the island of Réunion. It is based on the Byte Pair Encoding (BPE) model and optimized for the lexical and orthographic specificities of the language.
[CLS]
: Start-of-sequence token for classification tasks.[SEP]
: Separator token for multi-segment inputs.[PAD]
: Padding token.[MASK]
: Masking token used for training masked language models.[UNK]
: Token for unknown words.You can easily load this tokenizer using the transformers
library:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer")
# Example of tokenization
text = "Comment i lé zot tout ?"
tokens = tokenizer.encode(text)
print(tokens)
Hugo How-Choong