You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card for Custom WordPiece Tokenizer with Normalized Frequency Scoring

Model Details

Model Description

This project introduces a custom WordPiece tokenizer designed for enhanced natural language processing tasks. The tokenizer was built from scratch, implementing a WordPiece algorithm and integrating a normalized frequency scoring system to improve token selection and vocabulary generation.

  • Developed by: Anton Krasniuk, antoshka1608
  • Model type: Custom WordPiece Tokenizer
  • Language(s): English
  • Finetuned from model: N/A

Uses

Direct Use

This tokenizer is intended for use in:

  • NLP tasks requiring tokenization of text for language models.
  • Projects that require precise token representation with normalized scoring.
  • Preprocessing pipelines for large-scale language model training.

Downstream Use

This tokenizer is ideal for:

  • Custom-trained models where vocabulary optimization and token frequency distribution are critical.
  • Fine-tuning tasks where better tokenization enhances training efficiency and model performance.

Out-of-Scope Use

  • Tokenization of non-English or low-resource languages (requires additional training).
  • Domains where subword-level tokenization might not be suitable, such as phoneme-based tasks.

Bias, Risks, and Limitations

Recommendations

This tokenizer should be used with datasets where token frequency distribution is meaningful. It is essential to be mindful of data sparsity or skewed distributions, which could impact the normalization process.


How to Get Started with the Tokenizer

Here’s how to initialize and use the tokenizer:

from custom_wordpiece_tokenizer import BaseTokenizer

# Load pretrained tokenizer
tokenizer = BaseTokenizer.from_pretrained("antoshka1608/wordpiece-tokenizer-v1")

# Tokenize your text
tokens = tokenizer.tokenize("Sample input text")
print(tokens)

Training Details

Training Procedure

The tokenizer was trained using the following approach:

  • Implementation of the WordPiece algorithm to build the vocabulary.
  • Introduction of a normalized frequency scoring system for token selection, balancing token frequency and subword importance.

Training Hyperparameters

  • Vocabulary Size: 10,000
  • Normalization Frequency Update: Scaled to balance token frequency and rarity.
  • Scoring Metrics: Weighted frequency scores with normalization.

Evaluation

Metrics

The tokenizer was evaluated using:

  • Tokenization accuracy: Measuring the alignment between the tokenized output and the expected vocabulary coverage.
  • Compression ratio: Evaluating the efficiency of text compression for language model inputs.

Results

  • Achieved a higher compression ratio compared to standard WordPiece implementations.
  • Improved tokenization alignment for datasets with highly imbalanced token frequencies.

Model Examination

Normalization Formula

Score(merge)=frequency(merge)frequency(token_A)max_frequencyâ‹…frequency(token_B)max_frequency \text{Score(merge)} = \frac{\text{frequency}(merge)}{\frac{\text{frequency}(token\_A)}{\text{max\_frequency}} \cdot \frac{\text{frequency}(token\_B)}{\text{max\_frequency}}}

This adjustment ensures rare tokens are not overly penalized while maintaining proportional weight for high-frequency tokens.


Technical Specifications

Model Architecture and Objective

  • Implements WordPiece tokenization with an added frequency normalization step.
  • Supports special tokens for various NLP tasks, including <s>, </s>, <pad>, <mask>.

Model Card Authors

Downloads last month
0
Inference API
Unable to determine this model's library. Check the docs .