rn-tokenizer / README.md
eligapris's picture
Update README.md
8df0b89 verified
metadata
license: mit
datasets:
  - eligapris/kirundi-english
language:
  - rn
library_name: transformers

eligapris/rn-tokenizer

Model Description

This repository contains a BPE tokenizer trained specifically for the Kirundi language (ISO code: run).

Tokenizer Details

  • Type: BPE (Byte-Pair Encoding)
  • Vocabulary Size: 30,000 tokens
  • Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
  • Pre-tokenization: Whitespace-based

Intended Uses & Limitations

Intended Uses

  • Text processing for Kirundi language
  • Pre-processing for NLP tasks involving Kirundi
  • Foundation for developing Kirundi language applications

Limitations

  • The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
  • Limited to the vocabulary observed in the training data
  • Performance may vary on domain-specific text

Training Data

The tokenizer was trained on the Kirundi-English parallel corpus:

  • Dataset: eligapris/kirundi-english
  • Size: 21.4k sentence pairs
  • Nature: Parallel corpus with Kirundi and English translations
  • Domain: Mixed domain including religious, general, and conversational text

Installation

You can use this tokenizer in your project by first installing the required dependencies:

pip install transformers

Then load the tokenizer directly from the Hugging Face Hub:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer")

Or if you have downloaded the tokenizer files locally:

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

Usage Examples

Loading and Using the Tokenizer

You can load the tokenizer in two ways:

# Method 1: Using AutoTokenizer (recommended)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer")

# Method 2: Using PreTrainedTokenizerFast with local file
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

Basic Usage Examples

  1. Tokenize a single sentence:
# Basic tokenization
text = "ab'umudugudu hafi ya bose bateranira kumva ijambo ry'Imana."
encoded = tokenizer(text)
print(f"Input IDs: {encoded['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}")
  1. Batch tokenization:
# Process multiple sentences at once
texts = [
    "ifumbire mvaruganda.",
    "aba azi gukora kandi afite ubushobozi"
]
encoded = tokenizer(texts, padding=True, truncation=True)
print("Batch encoding:", encoded)
  1. Get token IDs with special tokens:
# Add special tokens like [CLS] and [SEP]
encoded = tokenizer(text, add_special_tokens=True)
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
print(f"Tokens with special tokens: {tokens}")
  1. Decode tokenized text:
# Convert token IDs back to text
ids = encoded['input_ids']
decoded_text = tokenizer.decode(ids)
print(f"Decoded text: {decoded_text}")
  1. Padding and truncation:
# Pad or truncate sequences to a specific length
encoded = tokenizer(
    texts,
    padding='max_length',
    max_length=32,
    truncation=True,
    return_tensors='pt'  # Return PyTorch tensors
)
print("Padded sequences:", encoded['input_ids'].shape)

Future Development

This tokenizer is intended to serve as a foundation for future Kirundi language model development, including potential fine-tuning with techniques like LoRA (Low-Rank Adaptation).

Technical Specifications

Software Requirements

dependencies = {
    "transformers": ">=4.30.0",
    "tokenizers": ">=0.13.0"
}

Contact

eligrapris


Updates and Versions

  • v1.0.0 (Initial Release)
    • Base tokenizer implementation
    • Trained on Kirundi-English parallel corpus
    • Basic functionality and documentation

Acknowledgments

  • Dataset provided by eligapris
  • Hugging Face's Transformers and Tokenizers libraries