--- language: hi tags: - hindi - tokenizer - bpe - subword - text-processing pipeline_tag: text2text-generation inference: true license: mit --- # Hindi Byte Pair Encoding (BPE) Tokenizer A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence. ## Project Overview This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features: - Efficient trie-based tokenization - Visualization of training progress - Compression ratio optimization - Support for large Hindi text datasets - Hugging Face compatibility ## Project Structure hindi-bpe/ ├── data/ # Dataset directory │ ├── train/ # Training data │ └── valid/ # Validation data ├── tokenizer/ # Saved tokenizer files │ ├── encoder.json # Encoder state │ └── vocab_stats.json # Vocabulary statistics ├── output/ # Visualization outputs ├── byte_pair_encoder.py # Core BPE implementation ├── hindi_bpe.py # Hindi-specific wrapper ├── test_hindi_bpe.py # Test suite └── requirements.txt # Dependencies ## Training stats - Iteration 4500: - Vocabulary size: 4,477 - Data size: 448,754 - Compression ratio: 3.66 - Max token length: 64 ## File Descriptions 1. **byte_pair_encoder.py** - Core BPE implementation - Trie-based tokenization - Training statistics tracking - Visualization utilities 2. **hindi_bpe.py** - Hindi-specific tokenizer wrapper - Text preprocessing - Model saving/loading - Compression ratio calculation 3. **app.py** - Interactive web interface - Real-time tokenization - Training visualization - Model parameter tuning 4. **test_hindi_bpe.py** - Test suite for tokenizer - Performance benchmarks - Example usage ## Installation - bash - Clone repository - git clone https://github.com/yourusername/hindi-bpe.git - cd hindi-bpe - pip install -r requirements.txt ## Download and prepare dataset - python download_dataset.py ### Web Interface - streamlit run app.py ### Test- - python test_hindi_bpe.py - The test suite includes: - Training pipeline verification - Compression ratio validation - Token count requirements - Encoding/decoding accuracy ## Performance Metrics The tokenizer aims to achieve: - Vocabulary size < 5000 tokens - Compression ratio ≥ 3.2 - Fast encoding/decoding - Memory-efficient operation ## Contributing 1. Fork the repository 2. Create feature branch 3. Commit changes 4. Push to branch 5. Create Pull Request ## License This project is licensed under the MIT License - see the LICENSE file for details.