rn-tokenizer / README.md
eligapris's picture
Update README.md
8df0b89 verified
---
license: mit
datasets:
- eligapris/kirundi-english
language:
- rn
library_name: transformers
---
# eligapris/rn-tokenizer
## Model Description
This repository contains a BPE tokenizer trained specifically for the Kirundi language (ISO code: run).
### Tokenizer Details
- **Type**: BPE (Byte-Pair Encoding)
- **Vocabulary Size**: 30,000 tokens
- **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
- **Pre-tokenization**: Whitespace-based
## Intended Uses & Limitations
### Intended Uses
- Text processing for Kirundi language
- Pre-processing for NLP tasks involving Kirundi
- Foundation for developing Kirundi language applications
### Limitations
- The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
- Limited to the vocabulary observed in the training data
- Performance may vary on domain-specific text
## Training Data
The tokenizer was trained on the Kirundi-English parallel corpus:
- **Dataset**: eligapris/kirundi-english
- **Size**: 21.4k sentence pairs
- **Nature**: Parallel corpus with Kirundi and English translations
- **Domain**: Mixed domain including religious, general, and conversational text
## Installation
You can use this tokenizer in your project by first installing the required dependencies:
```bash
pip install transformers
```
Then load the tokenizer directly from the Hugging Face Hub:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer")
```
Or if you have downloaded the tokenizer files locally:
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
```
## Usage Examples
### Loading and Using the Tokenizer
You can load the tokenizer in two ways:
```python
# Method 1: Using AutoTokenizer (recommended)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer")
# Method 2: Using PreTrainedTokenizerFast with local file
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
```
#### Basic Usage Examples
1. Tokenize a single sentence:
```python
# Basic tokenization
text = "ab'umudugudu hafi ya bose bateranira kumva ijambo ry'Imana."
encoded = tokenizer(text)
print(f"Input IDs: {encoded['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}")
```
2. Batch tokenization:
```python
# Process multiple sentences at once
texts = [
"ifumbire mvaruganda.",
"aba azi gukora kandi afite ubushobozi"
]
encoded = tokenizer(texts, padding=True, truncation=True)
print("Batch encoding:", encoded)
```
3. Get token IDs with special tokens:
```python
# Add special tokens like [CLS] and [SEP]
encoded = tokenizer(text, add_special_tokens=True)
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
print(f"Tokens with special tokens: {tokens}")
```
4. Decode tokenized text:
```python
# Convert token IDs back to text
ids = encoded['input_ids']
decoded_text = tokenizer.decode(ids)
print(f"Decoded text: {decoded_text}")
```
5. Padding and truncation:
```python
# Pad or truncate sequences to a specific length
encoded = tokenizer(
texts,
padding='max_length',
max_length=32,
truncation=True,
return_tensors='pt' # Return PyTorch tensors
)
print("Padded sequences:", encoded['input_ids'].shape)
```
## Future Development
This tokenizer is intended to serve as a foundation for future Kirundi language model development, including potential fine-tuning with techniques like LoRA (Low-Rank Adaptation).
## Technical Specifications
### Software Requirements
```python
dependencies = {
"transformers": ">=4.30.0",
"tokenizers": ">=0.13.0"
}
```
## Contact
eligrapris
---
## Updates and Versions
- v1.0.0 (Initial Release)
- Base tokenizer implementation
- Trained on Kirundi-English parallel corpus
- Basic functionality and documentation
## Acknowledgments
- Dataset provided by eligapris
- Hugging Face's Transformers and Tokenizers libraries