bpe-hindi / README.md
aayushraina's picture
Upload README.md
f383bd4 verified
|
raw
history blame
2.74 kB
metadata
language: hi
tags:
  - hindi
  - tokenizer
  - bpe
  - subword
  - text-processing
pipeline_tag: text2text-generation
inference: true
license: mit

Hindi Byte Pair Encoding (BPE) Tokenizer

A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence.

Project Overview

This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features:

  • Efficient trie-based tokenization
  • Visualization of training progress
  • Compression ratio optimization
  • Support for large Hindi text datasets
  • Hugging Face compatibility

Project Structure

hindi-bpe/ β”œβ”€β”€ data/ # Dataset directory β”‚ β”œβ”€β”€ train/ # Training data β”‚ └── valid/ # Validation data β”œβ”€β”€ tokenizer/ # Saved tokenizer files β”‚ β”œβ”€β”€ encoder.json # Encoder state β”‚ └── vocab_stats.json # Vocabulary statistics β”œβ”€β”€ output/ # Visualization outputs β”œβ”€β”€ byte_pair_encoder.py # Core BPE implementation β”œβ”€β”€ hindi_bpe.py # Hindi-specific wrapper β”œβ”€β”€ test_hindi_bpe.py # Test suite └── requirements.txt # Dependencies

Training stats

- Iteration 4500:
- Vocabulary size: 4,477
- Data size: 448,754
- Compression ratio: 3.66
- Max token length: 64

File Descriptions

  1. byte_pair_encoder.py

    • Core BPE implementation
    • Trie-based tokenization
    • Training statistics tracking
    • Visualization utilities
  2. hindi_bpe.py

    • Hindi-specific tokenizer wrapper
    • Text preprocessing
    • Model saving/loading
    • Compression ratio calculation
  3. app.py

    • Interactive web interface
    • Real-time tokenization
    • Training visualization
    • Model parameter tuning
  4. test_hindi_bpe.py

    • Test suite for tokenizer
    • Performance benchmarks
    • Example usage

Installation

- bash
- Clone repository
- git clone https://github.com/yourusername/hindi-bpe.git
- cd hindi-bpe
- pip install -r requirements.txt

Download and prepare dataset

- python download_dataset.py

Web Interface

- streamlit run app.py

Test-

- python test_hindi_bpe.py
- The test suite includes:
- Training pipeline verification
- Compression ratio validation
- Token count requirements
- Encoding/decoding accuracy

Performance Metrics

The tokenizer aims to achieve:
- Vocabulary size < 5000 tokens
- Compression ratio β‰₯ 3.2
- Fast encoding/decoding
- Memory-efficient operation

Contributing

  1. Fork the repository
  2. Create feature branch
  3. Commit changes
  4. Push to branch
  5. Create Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.