--- license: mit datasets: - eligapris/kirundi-english language: - rn library_name: transformers --- # eligapris/rn-tokenizer ## Model Description This repository contains a BPE tokenizer trained specifically for the Kirundi language (ISO code: run). ### Tokenizer Details - **Type**: BPE (Byte-Pair Encoding) - **Vocabulary Size**: 30,000 tokens - **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK] - **Pre-tokenization**: Whitespace-based ## Intended Uses & Limitations ### Intended Uses - Text processing for Kirundi language - Pre-processing for NLP tasks involving Kirundi - Foundation for developing Kirundi language applications ### Limitations - The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects - Limited to the vocabulary observed in the training data - Performance may vary on domain-specific text ## Training Data The tokenizer was trained on the Kirundi-English parallel corpus: - **Dataset**: eligapris/kirundi-english - **Size**: 21.4k sentence pairs - **Nature**: Parallel corpus with Kirundi and English translations - **Domain**: Mixed domain including religious, general, and conversational text ## Installation You can use this tokenizer in your project by first installing the required dependencies: ```bash pip install transformers ``` Then load the tokenizer directly from the Hugging Face Hub: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer") ``` Or if you have downloaded the tokenizer files locally: ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") ``` ## Usage Examples ### Loading and Using the Tokenizer You can load the tokenizer in two ways: ```python # Method 1: Using AutoTokenizer (recommended) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer") # Method 2: Using PreTrainedTokenizerFast with local file from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") ``` #### Basic Usage Examples 1. Tokenize a single sentence: ```python # Basic tokenization text = "ab'umudugudu hafi ya bose bateranira kumva ijambo ry'Imana." encoded = tokenizer(text) print(f"Input IDs: {encoded['input_ids']}") print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}") ``` 2. Batch tokenization: ```python # Process multiple sentences at once texts = [ "ifumbire mvaruganda.", "aba azi gukora kandi afite ubushobozi" ] encoded = tokenizer(texts, padding=True, truncation=True) print("Batch encoding:", encoded) ``` 3. Get token IDs with special tokens: ```python # Add special tokens like [CLS] and [SEP] encoded = tokenizer(text, add_special_tokens=True) tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids']) print(f"Tokens with special tokens: {tokens}") ``` 4. Decode tokenized text: ```python # Convert token IDs back to text ids = encoded['input_ids'] decoded_text = tokenizer.decode(ids) print(f"Decoded text: {decoded_text}") ``` 5. Padding and truncation: ```python # Pad or truncate sequences to a specific length encoded = tokenizer( texts, padding='max_length', max_length=32, truncation=True, return_tensors='pt' # Return PyTorch tensors ) print("Padded sequences:", encoded['input_ids'].shape) ``` ## Future Development This tokenizer is intended to serve as a foundation for future Kirundi language model development, including potential fine-tuning with techniques like LoRA (Low-Rank Adaptation). ## Technical Specifications ### Software Requirements ```python dependencies = { "transformers": ">=4.30.0", "tokenizers": ">=0.13.0" } ``` ## Contact eligrapris --- ## Updates and Versions - v1.0.0 (Initial Release) - Base tokenizer implementation - Trained on Kirundi-English parallel corpus - Basic functionality and documentation ## Acknowledgments - Dataset provided by eligapris - Hugging Face's Transformers and Tokenizers libraries