Saiteja commited on
Commit
7d13db4
·
verified ·
1 Parent(s): 7878358

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +86 -19
README.md CHANGED
@@ -1,38 +1,105 @@
1
  ---
2
- language: te
3
- tags:
4
- - telugu
5
- - tokenizer
6
- - bpe
7
- license: mit
8
- ---
9
 
10
- # Telugu BPE Tokenizer
11
 
12
- A Byte-Pair Encoding (BPE) tokenizer trained on Telugu text data from Wikipedia.
 
 
 
 
 
13
 
14
- ## Model Description
 
 
 
 
 
15
 
16
- This tokenizer was trained on Telugu text data collected from Wikipedia articles. It uses Byte-Pair Encoding (BPE) to create subword tokens.
 
 
 
 
 
17
 
18
- ## Stats
19
- - Vocabulary Size: 50000 tokens
20
- - Compression Ratio: 3.43
 
 
 
 
 
 
 
 
 
21
 
22
  ## Usage
23
 
 
 
 
 
 
 
24
  ```python
25
  from tokenizers import Tokenizer
26
 
27
  # Load the tokenizer
28
  tokenizer = Tokenizer.from_file("tokenizer.json")
29
 
30
- # Tokenize text
31
- text = "నమస్కారం"
32
  encoding = tokenizer.encode(text)
33
- print(encoding.tokens)
 
 
 
34
  ```
35
 
36
- ## Training Data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- The tokenizer was trained on Telugu text data collected from Wikipedia articles. The data includes a diverse range of topics and writing styles.
 
1
  ---
2
+ # Telugu Tokenizer
3
+
4
+ A Unigram tokenizer specifically trained for the Telugu language using a large corpus of Telugu text from Wikipedia and news sources. This tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.
 
 
 
 
5
 
6
+ ## Key Features
7
 
8
+ ### Tokenizer Statistics
9
+ - **Vocabulary Size**: 50000 tokens (✓ Exceeds requirement of 5000+)
10
+ - **Compression Ratio**: 6.77 (✓ Meets requirement of ≥3.0)
11
+ - **Average Token Length**: 6.26 characters
12
+ - **Training Data**: 2,500+ Telugu articles
13
+ - **Minimum Text Length**: 500 characters per article
14
 
15
+ ### Model Configuration
16
+ - **Architecture**: Unigram Language Model
17
+ - **Max Piece Length**: 128
18
+ - **Sub-iterations**: 20
19
+ - **Initial Vocabulary**: 50000 tokens
20
+ - **Auto-scaling**: Up to 500,000 tokens if needed
21
 
22
+ ### Special Tokens
23
+ - `<s>`: Start of text token
24
+ - `</s>`: End of text token
25
+ - `<unk>`: Unknown token
26
+ - `<pad>`: Padding token
27
+ - `<mask>`: Mask token (for potential MLM tasks)
28
 
29
+ ## Dataset Details
30
+ - **Sources**:
31
+ - Telugu Wikipedia articles
32
+ - Major Telugu news websites
33
+ - Combined and cleaned text corpus
34
+ - **Content**: Diverse topics including literature, culture, history, and general knowledge
35
+ - **Preprocessing**:
36
+ - Removed references and citations
37
+ - Normalized whitespace
38
+ - Filtered short articles
39
+ - Cleaned special characters
40
+ - Combined short texts for better context
41
 
42
  ## Usage
43
 
44
+ ### Installation
45
+ ```bash
46
+ pip install tokenizers
47
+ ```
48
+
49
+ ### Basic Usage
50
  ```python
51
  from tokenizers import Tokenizer
52
 
53
  # Load the tokenizer
54
  tokenizer = Tokenizer.from_file("tokenizer.json")
55
 
56
+ # Encode text
57
+ text = "నమస్కారం" # Hello
58
  encoding = tokenizer.encode(text)
59
+
60
+ # Get tokens
61
+ print("Tokens:", encoding.tokens)
62
+ print("Token IDs:", encoding.ids)
63
  ```
64
 
65
+ ### Example Outputs
66
+ ```python
67
+ # Input: "తెలుగు భాష చాలా అందమైనది"
68
+ # Output tokens: ['తెలుగు', ' భాష', ' చాలా', ' అంద', 'మైన', 'ది']
69
+ ```
70
+
71
+ ## Technical Details
72
+
73
+ ### Tokenizer Configuration
74
+ - **Model**: Unigram Language Model (SentencePiece-style)
75
+ - **Pre-tokenization**: ByteLevel + Character-level splitting
76
+ - **Decoder**: ByteLevel
77
+ - **Post-processor**: ByteLevel with trimmed offsets
78
+
79
+ ### Performance Metrics
80
+ 1. **Compression Ratio**: 6.77
81
+ - Calculated as: total_chars / total_tokens
82
+ - Higher ratio indicates better compression
83
+ - Median ratio: 7.05
84
+ 2. **Vocabulary Coverage**: 50000 unique tokens
85
+ - Includes special tokens
86
+ - Optimized for Telugu language patterns
87
+ - Auto-scales vocabulary size for better compression
88
+
89
+ ## Examples
90
+ Check `examples.json` for more tokenization examples with different types of Telugu text, including:
91
+ - Short phrases
92
+ - Complete sentences
93
+ - Long paragraphs
94
+ - Various writing styles
95
+
96
+ ## Training Process
97
+ The tokenizer was trained using the following steps:
98
+ 1. Collected 2,500+ Telugu articles from multiple sources
99
+ 2. Cleaned and preprocessed the text
100
+ 3. Combined short texts to create better context
101
+ 4. Trained Unigram model with initial vocab size of 50,000
102
+ 5. Auto-scaled vocabulary if needed for better compression
103
+ 6. Validated against requirements
104
+
105