rn-tokenizer / README.md

Update README.md

8df0b89 verified about 2 months ago

4.08 kB

	---
	license: mit
	datasets:
	- eligapris/kirundi-english
	language:
	- rn
	library_name: transformers
	---
	# eligapris/rn-tokenizer

	## Model Description

	This repository contains a BPE tokenizer trained specifically for the Kirundi language (ISO code: run).

	### Tokenizer Details
	- Type: BPE (Byte-Pair Encoding)
	- Vocabulary Size: 30,000 tokens
	- Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
	- Pre-tokenization: Whitespace-based

	## Intended Uses & Limitations

	### Intended Uses
	- Text processing for Kirundi language
	- Pre-processing for NLP tasks involving Kirundi
	- Foundation for developing Kirundi language applications

	### Limitations
	- The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
	- Limited to the vocabulary observed in the training data
	- Performance may vary on domain-specific text

	## Training Data

	The tokenizer was trained on the Kirundi-English parallel corpus:
	- Dataset: eligapris/kirundi-english
	- Size: 21.4k sentence pairs
	- Nature: Parallel corpus with Kirundi and English translations
	- Domain: Mixed domain including religious, general, and conversational text

	## Installation

	You can use this tokenizer in your project by first installing the required dependencies:

	```bash
	pip install transformers
	```

	Then load the tokenizer directly from the Hugging Face Hub:

	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer")
	```

	Or if you have downloaded the tokenizer files locally:

	```python
	from transformers import PreTrainedTokenizerFast
	tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
	```

	## Usage Examples

	### Loading and Using the Tokenizer

	You can load the tokenizer in two ways:

	```python
	# Method 1: Using AutoTokenizer (recommended)
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("eligapris/rn-tokenizer")

	# Method 2: Using PreTrainedTokenizerFast with local file
	from transformers import PreTrainedTokenizerFast
	tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
	```

	#### Basic Usage Examples

	1. Tokenize a single sentence:
	```python
	# Basic tokenization
	text = "ab'umudugudu hafi ya bose bateranira kumva ijambo ry'Imana."
	encoded = tokenizer(text)
	print(f"Input IDs: {encoded['input_ids']}")
	print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}")
	```

	2. Batch tokenization:
	```python
	# Process multiple sentences at once
	texts = [
	"ifumbire mvaruganda.",
	"aba azi gukora kandi afite ubushobozi"
	]
	encoded = tokenizer(texts, padding=True, truncation=True)
	print("Batch encoding:", encoded)
	```

	3. Get token IDs with special tokens:
	```python
	# Add special tokens like [CLS] and [SEP]
	encoded = tokenizer(text, add_special_tokens=True)
	tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
	print(f"Tokens with special tokens: {tokens}")
	```

	4. Decode tokenized text:
	```python
	# Convert token IDs back to text
	ids = encoded['input_ids']
	decoded_text = tokenizer.decode(ids)
	print(f"Decoded text: {decoded_text}")
	```

	5. Padding and truncation:
	```python
	# Pad or truncate sequences to a specific length
	encoded = tokenizer(
	texts,
	padding='max_length',
	max_length=32,
	truncation=True,
	return_tensors='pt' # Return PyTorch tensors
	)
	print("Padded sequences:", encoded['input_ids'].shape)
	```

	## Future Development
	This tokenizer is intended to serve as a foundation for future Kirundi language model development, including potential fine-tuning with techniques like LoRA (Low-Rank Adaptation).

	## Technical Specifications

	### Software Requirements
	```python
	dependencies = {
	"transformers": ">=4.30.0",
	"tokenizers": ">=0.13.0"
	}
	```


	## Contact

	eligrapris

	---

	## Updates and Versions

	- v1.0.0 (Initial Release)
	- Base tokenizer implementation
	- Trained on Kirundi-English parallel corpus
	- Basic functionality and documentation

	## Acknowledgments

	- Dataset provided by eligapris
	- Hugging Face's Transformers and Tokenizers libraries