Saiteja
/

telugu-bpe

@@ -1,38 +1,105 @@
 ---
-language: te
-tags:
-- telugu
-- tokenizer
-- bpe
-license: mit
----
-# Telugu BPE Tokenizer
-A Byte-Pair Encoding (BPE) tokenizer trained on Telugu text data from Wikipedia.
-## Model Description
-This tokenizer was trained on Telugu text data collected from Wikipedia articles. It uses Byte-Pair Encoding (BPE) to create subword tokens.
-## Stats
-- Vocabulary Size: 50000 tokens
-- Compression Ratio: 3.43
 ## Usage
 ```python
 from tokenizers import Tokenizer
 # Load the tokenizer
 tokenizer = Tokenizer.from_file("tokenizer.json")
-# Tokenize text
-text = "నమస్కారం"
 encoding = tokenizer.encode(text)
-print(encoding.tokens)
 ```
-## Training Data
-The tokenizer was trained on Telugu text data collected from Wikipedia articles. The data includes a diverse range of topics and writing styles.

 ---
+# Telugu Tokenizer
+A Unigram tokenizer specifically trained for the Telugu language using a large corpus of Telugu text from Wikipedia and news sources. This tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.
+## Key Features
+### Tokenizer Statistics
+- **Vocabulary Size**: 50000 tokens (✓ Exceeds requirement of 5000+)
+- **Compression Ratio**: 6.77 (✓ Meets requirement of ≥3.0)
+- **Average Token Length**: 6.26 characters
+- **Training Data**: 2,500+ Telugu articles
+- **Minimum Text Length**: 500 characters per article
+### Model Configuration
+- **Architecture**: Unigram Language Model
+- **Max Piece Length**: 128
+- **Sub-iterations**: 20
+- **Initial Vocabulary**: 50000 tokens
+- **Auto-scaling**: Up to 500,000 tokens if needed
+### Special Tokens
+- `<s>`: Start of text token
+- `</s>`: End of text token
+- `<unk>`: Unknown token
+- `<pad>`: Padding token
+- `<mask>`: Mask token (for potential MLM tasks)
+## Dataset Details
+- **Sources**:
+  - Telugu Wikipedia articles
+  - Major Telugu news websites
+  - Combined and cleaned text corpus
+- **Content**: Diverse topics including literature, culture, history, and general knowledge
+- **Preprocessing**:
+  - Removed references and citations
+  - Normalized whitespace
+  - Filtered short articles
+  - Cleaned special characters
+  - Combined short texts for better context
 ## Usage
+### Installation
+```bash
+pip install tokenizers
+```
+### Basic Usage
 ```python
 from tokenizers import Tokenizer
 # Load the tokenizer
 tokenizer = Tokenizer.from_file("tokenizer.json")
+# Encode text
+text = "నమస్కారం"  # Hello
 encoding = tokenizer.encode(text)
+# Get tokens
+print("Tokens:", encoding.tokens)
+print("Token IDs:", encoding.ids)
 ```
+### Example Outputs
+```python
+# Input: "తెలుగు భాష చాలా అందమైనది"
+# Output tokens: ['తెలుగు', ' భాష', ' చాలా', ' అంద', 'మైన', 'ది']
+```
+## Technical Details
+### Tokenizer Configuration
+- **Model**: Unigram Language Model (SentencePiece-style)
+- **Pre-tokenization**: ByteLevel + Character-level splitting
+- **Decoder**: ByteLevel
+- **Post-processor**: ByteLevel with trimmed offsets
+### Performance Metrics
+1. **Compression Ratio**: 6.77
+   - Calculated as: total_chars / total_tokens
+   - Higher ratio indicates better compression
+   - Median ratio: 7.05
+2. **Vocabulary Coverage**: 50000 unique tokens
+   - Includes special tokens
+   - Optimized for Telugu language patterns
+   - Auto-scales vocabulary size for better compression
+## Examples
+Check `examples.json` for more tokenization examples with different types of Telugu text, including:
+- Short phrases
+- Complete sentences
+- Long paragraphs
+- Various writing styles
+## Training Process
+The tokenizer was trained using the following steps:
+1. Collected 2,500+ Telugu articles from multiple sources
+2. Cleaned and preprocessed the text
+3. Combined short texts to create better context
+4. Trained Unigram model with initial vocab size of 50,000
+5. Auto-scaled vocabulary if needed for better compression
+6. Validated against requirements