Tokenizer for Réunion Creole 🇷🇪

This tokenizer is specifically designed for working with Réunion Creole, a language primarily spoken on the island of Réunion. It is based on the Byte Pair Encoding (BPE) model and optimized for the lexical and orthographic specificities of the language.

Features

  • Built using the BPE (Byte Pair Encoding) model.
  • Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book.
  • Supports special tokens for common NLP tasks:
    • [CLS]: Start-of-sequence token for classification tasks.
    • [SEP]: Separator token for multi-segment inputs.
    • [PAD]: Padding token.
    • [MASK]: Masking token used for training masked language models.
    • [UNK]: Token for unknown words.

Usage

Loading the Tokenizer

You can easily load this tokenizer using the transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer")

# Example of tokenization
text = "Comment i lé zot tout ?"
tokens = tokenizer.encode(text)
print(tokens)

Hugo How-Choong

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Collection including hugohow/creole_reunion_tokenizer