Model Card for Custom WordPiece Tokenizer with Normalized Frequency Scoring
Model Details
Model Description
This project introduces a custom WordPiece tokenizer designed for enhanced natural language processing tasks. The tokenizer was built from scratch, implementing a WordPiece algorithm and integrating a normalized frequency scoring system to improve token selection and vocabulary generation.
- Developed by: Anton Krasniuk, antoshka1608
- Model type: Custom WordPiece Tokenizer
- Language(s): English
- Finetuned from model: N/A
Uses
Direct Use
This tokenizer is intended for use in:
- NLP tasks requiring tokenization of text for language models.
- Projects that require precise token representation with normalized scoring.
- Preprocessing pipelines for large-scale language model training.
Downstream Use
This tokenizer is ideal for:
- Custom-trained models where vocabulary optimization and token frequency distribution are critical.
- Fine-tuning tasks where better tokenization enhances training efficiency and model performance.
Out-of-Scope Use
- Tokenization of non-English or low-resource languages (requires additional training).
- Domains where subword-level tokenization might not be suitable, such as phoneme-based tasks.
Bias, Risks, and Limitations
Recommendations
This tokenizer should be used with datasets where token frequency distribution is meaningful. It is essential to be mindful of data sparsity or skewed distributions, which could impact the normalization process.
How to Get Started with the Tokenizer
Here’s how to initialize and use the tokenizer:
from custom_wordpiece_tokenizer import BaseTokenizer
# Load pretrained tokenizer
tokenizer = BaseTokenizer.from_pretrained("antoshka1608/wordpiece-tokenizer-v1")
# Tokenize your text
tokens = tokenizer.tokenize("Sample input text")
print(tokens)
Training Details
Training Procedure
The tokenizer was trained using the following approach:
- Implementation of the WordPiece algorithm to build the vocabulary.
- Introduction of a normalized frequency scoring system for token selection, balancing token frequency and subword importance.
Training Hyperparameters
- Vocabulary Size: 10,000
- Normalization Frequency Update: Scaled to balance token frequency and rarity.
- Scoring Metrics: Weighted frequency scores with normalization.
Evaluation
Metrics
The tokenizer was evaluated using:
- Tokenization accuracy: Measuring the alignment between the tokenized output and the expected vocabulary coverage.
- Compression ratio: Evaluating the efficiency of text compression for language model inputs.
Results
- Achieved a higher compression ratio compared to standard WordPiece implementations.
- Improved tokenization alignment for datasets with highly imbalanced token frequencies.
Model Examination
Normalization Formula
This adjustment ensures rare tokens are not overly penalized while maintaining proportional weight for high-frequency tokens.
Technical Specifications
Model Architecture and Objective
- Implements WordPiece tokenization with an added frequency normalization step.
- Supports special tokens for various NLP tasks, including
<s>
,</s>
,<pad>
,<mask>
.
Model Card Authors
- Author: Anton Krasniuk
- Contact: [email protected]