Update README.md
Browse files
README.md
CHANGED
@@ -40,37 +40,35 @@ The tokenizer has been tested on multiple text categories:
|
|
40 |
|
41 |
#### Test Case 1: Basic sentence
|
42 |
**Original text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
|
43 |
-
**Encoded tokens:** ['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']
|
44 |
-
**Token IDs:** [0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]
|
45 |
**Decoded text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
|
46 |
**Analysis:**
|
47 |
-
- Number of tokens
|
48 |
-
- Average token length
|
49 |
-
- Reconstruction
|
50 |
|
51 |
#### Test Case 2: Complex sentence
|
52 |
**Original text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
|
53 |
-
**Encoded tokens:** ['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']
|
54 |
-
**Token IDs:** [0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]
|
55 |
**Decoded text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
|
56 |
**Analysis:**
|
57 |
-
- Number of tokens
|
58 |
-
- Average token length
|
59 |
-
- Reconstruction
|
60 |
|
61 |
### Category: Mixed Language
|
62 |
|
63 |
#### Test Case 1: Kannada with English
|
64 |
**Original text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
|
65 |
-
**Encoded tokens:** ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
|
66 |
-
**Token IDs:** [0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]
|
67 |
**Decoded text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
|
68 |
**Analysis:**
|
69 |
-
- Number of tokens
|
70 |
-
- Average token length
|
71 |
-
- Reconstruction
|
72 |
-
|
73 |
-
(Additional test cases can be added following the same format)
|
74 |
|
75 |
## Repository Structure
|
76 |
The repository consists of tokenizer files, configuration files, and documentation:
|
|
|
40 |
|
41 |
#### Test Case 1: Basic sentence
|
42 |
**Original text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
|
43 |
+
**Encoded tokens:** `['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']`
|
44 |
+
**Token IDs:** `[0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]`
|
45 |
**Decoded text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
|
46 |
**Analysis:**
|
47 |
+
- **Number of tokens:** 14
|
48 |
+
- **Average token length:** 1.29 characters
|
49 |
+
- **Reconstruction:** Perfect
|
50 |
|
51 |
#### Test Case 2: Complex sentence
|
52 |
**Original text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
|
53 |
+
**Encoded tokens:** `['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']`
|
54 |
+
**Token IDs:** `[0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]`
|
55 |
**Decoded text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
|
56 |
**Analysis:**
|
57 |
+
- **Number of tokens:** 26
|
58 |
+
- **Average token length:** 1.27 characters
|
59 |
+
- **Reconstruction:** Perfect
|
60 |
|
61 |
### Category: Mixed Language
|
62 |
|
63 |
#### Test Case 1: Kannada with English
|
64 |
**Original text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
|
65 |
+
**Encoded tokens:** `['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']`
|
66 |
+
**Token IDs:** `[0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]`
|
67 |
**Decoded text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
|
68 |
**Analysis:**
|
69 |
+
- **Number of tokens:** 36
|
70 |
+
- **Average token length:** 1.14 characters
|
71 |
+
- **Reconstruction:** Perfect
|
|
|
|
|
72 |
|
73 |
## Repository Structure
|
74 |
The repository consists of tokenizer files, configuration files, and documentation:
|