|
--- |
|
license: mit |
|
datasets: |
|
- eligapris/kirundi-english |
|
language: |
|
- rn |
|
library_name: transformers |
|
--- |
|
# eligapris/rn-tokenizer |
|
|
|
## Model Description |
|
|
|
This repository contains a BPE tokenizer trained specifically for the Kirundi language (ISO code: run). |
|
|
|
### Tokenizer Details |
|
- **Type**: BPE (Byte-Pair Encoding) |
|
- **Vocabulary Size**: 30,000 tokens |
|
- **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK] |
|
- **Pre-tokenization**: Whitespace-based |
|
|
|
## Intended Uses & Limitations |
|
|
|
### Intended Uses |
|
- Text processing for Kirundi language |
|
- Pre-processing for NLP tasks involving Kirundi |
|
- Foundation for developing Kirundi language applications |
|
|
|
### Limitations |
|
- The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects |
|
- Limited to the vocabulary observed in the training data |
|
- Performance may vary on domain-specific text |
|
|
|
## Training Data |
|
|
|
The tokenizer was trained on the Kirundi-English parallel corpus: |
|
- **Dataset**: eligapris/kirundi-english |
|
- **Size**: 21.4k sentence pairs |
|
- **Nature**: Parallel corpus with Kirundi and English translations |
|
- **Domain**: Mixed domain including religious, general, and conversational text |
|
|
|
## Installation |
|
|
|
You can use this tokenizer in your project by first installing the required dependencies: |
|
|
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
Then load the tokenizer directly from the Hugging Face Hub: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer") |
|
``` |
|
|
|
Or if you have downloaded the tokenizer files locally: |
|
|
|
```python |
|
from transformers import PreTrainedTokenizerFast |
|
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") |
|
``` |
|
|
|
## Usage Examples |
|
|
|
### Loading and Using the Tokenizer |
|
|
|
You can load the tokenizer in two ways: |
|
|
|
```python |
|
# Method 1: Using AutoTokenizer (recommended) |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer") |
|
|
|
# Method 2: Using PreTrainedTokenizerFast with local file |
|
from transformers import PreTrainedTokenizerFast |
|
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") |
|
``` |
|
|
|
#### Basic Usage Examples |
|
|
|
1. Tokenize a single sentence: |
|
```python |
|
# Basic tokenization |
|
text = "ab'umudugudu hafi ya bose bateranira kumva ijambo ry'Imana." |
|
encoded = tokenizer(text) |
|
print(f"Input IDs: {encoded['input_ids']}") |
|
print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}") |
|
``` |
|
|
|
2. Batch tokenization: |
|
```python |
|
# Process multiple sentences at once |
|
texts = [ |
|
"ifumbire mvaruganda.", |
|
"aba azi gukora kandi afite ubushobozi" |
|
] |
|
encoded = tokenizer(texts, padding=True, truncation=True) |
|
print("Batch encoding:", encoded) |
|
``` |
|
|
|
3. Get token IDs with special tokens: |
|
```python |
|
# Add special tokens like [CLS] and [SEP] |
|
encoded = tokenizer(text, add_special_tokens=True) |
|
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids']) |
|
print(f"Tokens with special tokens: {tokens}") |
|
``` |
|
|
|
4. Decode tokenized text: |
|
```python |
|
# Convert token IDs back to text |
|
ids = encoded['input_ids'] |
|
decoded_text = tokenizer.decode(ids) |
|
print(f"Decoded text: {decoded_text}") |
|
``` |
|
|
|
5. Padding and truncation: |
|
```python |
|
# Pad or truncate sequences to a specific length |
|
encoded = tokenizer( |
|
texts, |
|
padding='max_length', |
|
max_length=32, |
|
truncation=True, |
|
return_tensors='pt' # Return PyTorch tensors |
|
) |
|
print("Padded sequences:", encoded['input_ids'].shape) |
|
``` |
|
|
|
## Future Development |
|
This tokenizer is intended to serve as a foundation for future Kirundi language model development, including potential fine-tuning with techniques like LoRA (Low-Rank Adaptation). |
|
|
|
## Technical Specifications |
|
|
|
### Software Requirements |
|
```python |
|
dependencies = { |
|
"transformers": ">=4.30.0", |
|
"tokenizers": ">=0.13.0" |
|
} |
|
``` |
|
|
|
|
|
## Contact |
|
|
|
eligrapris |
|
|
|
--- |
|
|
|
## Updates and Versions |
|
|
|
- v1.0.0 (Initial Release) |
|
- Base tokenizer implementation |
|
- Trained on Kirundi-English parallel corpus |
|
- Basic functionality and documentation |
|
|
|
## Acknowledgments |
|
|
|
- Dataset provided by eligapris |
|
- Hugging Face's Transformers and Tokenizers libraries |