bpe_tok / README.md
ace-1's picture
hf space changes
e79fb1d

A newer version of the Gradio SDK is available: 5.27.0

Upgrade
metadata
title: Multilingual Tokenizer Comparison
emoji: πŸ”
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 4.19.2
app_file: app.py
pinned: false

Multilingual Tokenizer Comparison

A web application to compare tokenization between a custom multilingual BPE tokenizer and OpenAI's GPT-4 tokenizer.

Live Demo

Try it out: Huggingface Spaces Demo

Features

  • Supports multiple scripts:
    • Latin (English)
    • Devanagari (Hindi)
    • Kannada
  • Shows token counts and IDs for both tokenizers
  • Interactive web interface
  • Example texts for comparison

Tokenizer Details

Overview

The custom tokenizer was developed using Byte Pair Encoding (BPE) with a custom regex pattern designed specifically for multilingual text. The development process included:

  1. Custom Regex for BPE Tokenization:

    • A specialized regex pattern that handles English, Hindi, and Kannada scripts
    • Carefully designed to preserve linguistic units in each script
  2. Training Corpus Composition:

    • English (60%): From HuggingFaceFW/fineweb-edu dataset
    • Hindi (20%): From ai4bharat/sangraha dataset (Devanagari script)
    • Kannada (20%): From ai4bharat/sangraha dataset (Kannada script)
    • This distribution aligns with token distribution patterns observed in models like GPT-4
  3. Vocabulary Details:

    • Total Size: 3257 tokens
    • Composition:
      • 256 byte-level tokens
      • 3000 merge operations
      • 1 special <|endoftext|> token
    • Achieves approximately 4.07x compression ratio

Technical Implementation

The tokenizer implementation includes:

  • Custom regex patterns for multilingual text segmentation
  • BPE training with controlled merge operations
  • Special token handling
  • Efficient encoding/decoding mechanisms

Installation

# Clone the repository
git clone https://github.com/MohammedYaseen97/bpe_tok_era.git
cd bpe_tok_era

# Install dependencies
pip install -r requirements.txt

# Run the app locally
python app.py

Project Structure

β”œβ”€β”€ app.py # Gradio web interface
β”œβ”€β”€ tokenizer.py # Custom tokenizer implementation
β”œβ”€β”€ bpe_tok.model # Trained tokenizer model
β”œβ”€β”€ requirements.txt # Project dependencies
└── README.md # Project documentation

Development Process

The tokenizer development involved several key steps:

  1. Dataset Preparation:

    • Careful selection of multilingual datasets
    • Balanced sampling to maintain script representation
    • Text cleaning and preprocessing
  2. Tokenizer Training:

    • Custom regex pattern development
    • BPE training with controlled vocabulary growth
    • Optimization for multilingual support
  3. Performance Metrics:

    • Compression ratio: 4.07x
    • Balanced token distribution across scripts
    • Efficient handling of mixed-script text

Usage Examples

The tokenizer effectively handles various text combinations:

  • Pure English text
  • Pure Hindi text
  • Pure Kannada text
  • Mixed script text
  • Special tokens and control characters

License

MIT License

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request