ruthuvikas1998 commited on
Commit
01d3c74
·
verified ·
1 Parent(s): ec3dcf7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -17
README.md CHANGED
@@ -40,37 +40,35 @@ The tokenizer has been tested on multiple text categories:
40
 
41
  #### Test Case 1: Basic sentence
42
  **Original text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
43
- **Encoded tokens:** ['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']
44
- **Token IDs:** [0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]
45
  **Decoded text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
46
  **Analysis:**
47
- - Number of tokens: 14
48
- - Average token length: 1.29 characters
49
- - Reconstruction: Perfect
50
 
51
  #### Test Case 2: Complex sentence
52
  **Original text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
53
- **Encoded tokens:** ['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']
54
- **Token IDs:** [0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]
55
  **Decoded text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
56
  **Analysis:**
57
- - Number of tokens: 26
58
- - Average token length: 1.27 characters
59
- - Reconstruction: Perfect
60
 
61
  ### Category: Mixed Language
62
 
63
  #### Test Case 1: Kannada with English
64
  **Original text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
65
- **Encoded tokens:** ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
66
- **Token IDs:** [0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]
67
  **Decoded text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
68
  **Analysis:**
69
- - Number of tokens: 36
70
- - Average token length: 1.14 characters
71
- - Reconstruction: Perfect
72
-
73
- (Additional test cases can be added following the same format)
74
 
75
  ## Repository Structure
76
  The repository consists of tokenizer files, configuration files, and documentation:
 
40
 
41
  #### Test Case 1: Basic sentence
42
  **Original text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
43
+ **Encoded tokens:** `['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']`
44
+ **Token IDs:** `[0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]`
45
  **Decoded text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
46
  **Analysis:**
47
+ - **Number of tokens:** 14
48
+ - **Average token length:** 1.29 characters
49
+ - **Reconstruction:** Perfect
50
 
51
  #### Test Case 2: Complex sentence
52
  **Original text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
53
+ **Encoded tokens:** `['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']`
54
+ **Token IDs:** `[0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]`
55
  **Decoded text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
56
  **Analysis:**
57
+ - **Number of tokens:** 26
58
+ - **Average token length:** 1.27 characters
59
+ - **Reconstruction:** Perfect
60
 
61
  ### Category: Mixed Language
62
 
63
  #### Test Case 1: Kannada with English
64
  **Original text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
65
+ **Encoded tokens:** `['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']`
66
+ **Token IDs:** `[0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]`
67
  **Decoded text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
68
  **Analysis:**
69
+ - **Number of tokens:** 36
70
+ - **Average token length:** 1.14 characters
71
+ - **Reconstruction:** Perfect
 
 
72
 
73
  ## Repository Structure
74
  The repository consists of tokenizer files, configuration files, and documentation: