language: hi
tags:
- hindi
- tokenizer
- bpe
- subword
- text-processing
pipeline_tag: text2text-generation
inference: true
license: mit
Hindi Byte Pair Encoding (BPE) Tokenizer
A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence.
Project Overview
This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features:
- Efficient trie-based tokenization
- Visualization of training progress
- Compression ratio optimization
- Support for large Hindi text datasets
- Hugging Face compatibility
Project Structure
hindi-bpe/ βββ data/ # Dataset directory β βββ train/ # Training data β βββ valid/ # Validation data βββ tokenizer/ # Saved tokenizer files β βββ encoder.json # Encoder state β βββ vocab_stats.json # Vocabulary statistics βββ output/ # Visualization outputs βββ byte_pair_encoder.py # Core BPE implementation βββ hindi_bpe.py # Hindi-specific wrapper βββ test_hindi_bpe.py # Test suite βββ requirements.txt # Dependencies
Training stats
- Iteration 4500:
- Vocabulary size: 4,477
- Data size: 448,754
- Compression ratio: 3.66
- Max token length: 64
File Descriptions
byte_pair_encoder.py
- Core BPE implementation
- Trie-based tokenization
- Training statistics tracking
- Visualization utilities
hindi_bpe.py
- Hindi-specific tokenizer wrapper
- Text preprocessing
- Model saving/loading
- Compression ratio calculation
app.py
- Interactive web interface
- Real-time tokenization
- Training visualization
- Model parameter tuning
test_hindi_bpe.py
- Test suite for tokenizer
- Performance benchmarks
- Example usage
Installation
- bash
- Clone repository
- git clone https://github.com/yourusername/hindi-bpe.git
- cd hindi-bpe
- pip install -r requirements.txt
Download and prepare dataset
- python download_dataset.py
Web Interface
- streamlit run app.py
Test-
- python test_hindi_bpe.py
- The test suite includes:
- Training pipeline verification
- Compression ratio validation
- Token count requirements
- Encoding/decoding accuracy
Performance Metrics
The tokenizer aims to achieve:
- Vocabulary size < 5000 tokens
- Compression ratio β₯ 3.2
- Fast encoding/decoding
- Memory-efficient operation
Contributing
- Fork the repository
- Create feature branch
- Commit changes
- Push to branch
- Create Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.