utkarsh2299 commited on
Commit
2c8dc05
·
verified ·
1 Parent(s): bc761c1

Upload 97 files

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. Unified_parser/.vscode/tasks.json +28 -0
  2. Unified_parser/LICENSE +21 -0
  3. Unified_parser/README.md +34 -0
  4. Unified_parser/__pycache__/get_phone_mapped_python.cpython-311.pyc +0 -0
  5. Unified_parser/__pycache__/get_phone_mapped_python.cpython-37.pyc +0 -0
  6. Unified_parser/__pycache__/globals.cpython-310.pyc +0 -0
  7. Unified_parser/__pycache__/globals.cpython-311.pyc +0 -0
  8. Unified_parser/__pycache__/globals.cpython-37.pyc +0 -0
  9. Unified_parser/__pycache__/globals.cpython-38.pyc +0 -0
  10. Unified_parser/__pycache__/helpers.cpython-310.pyc +0 -0
  11. Unified_parser/__pycache__/helpers.cpython-311.pyc +0 -0
  12. Unified_parser/__pycache__/helpers.cpython-37.pyc +0 -0
  13. Unified_parser/__pycache__/helpers.cpython-38.pyc +0 -0
  14. Unified_parser/__pycache__/parallelparser.cpython-37.pyc +0 -0
  15. Unified_parser/__pycache__/parallelparser.cpython-38.pyc +0 -0
  16. Unified_parser/__pycache__/uparser.cpython-310.pyc +0 -0
  17. Unified_parser/__pycache__/uparser.cpython-37.pyc +0 -0
  18. Unified_parser/common.map +128 -0
  19. Unified_parser/common_hindi.map +128 -0
  20. Unified_parser/common_telugu.map +128 -0
  21. Unified_parser/dict/english.dict +0 -0
  22. Unified_parser/dict/english.dict_old +1 -0
  23. Unified_parser/dict/hindi.dict1 +1 -0
  24. Unified_parser/dict/malayalam.dict +1 -0
  25. Unified_parser/extract_words.py +33 -0
  26. Unified_parser/get_phone_mapped_python.py +76 -0
  27. Unified_parser/globals.py +71 -0
  28. Unified_parser/helpers.py +1031 -0
  29. Unified_parser/ply/__init__.py +5 -0
  30. Unified_parser/ply/__pycache__/__init__.cpython-310.pyc +0 -0
  31. Unified_parser/ply/__pycache__/__init__.cpython-311.pyc +0 -0
  32. Unified_parser/ply/__pycache__/__init__.cpython-37.pyc +0 -0
  33. Unified_parser/ply/__pycache__/__init__.cpython-38.pyc +0 -0
  34. Unified_parser/ply/__pycache__/lex.cpython-310.pyc +0 -0
  35. Unified_parser/ply/__pycache__/lex.cpython-311.pyc +0 -0
  36. Unified_parser/ply/__pycache__/lex.cpython-37.pyc +0 -0
  37. Unified_parser/ply/__pycache__/lex.cpython-38.pyc +0 -0
  38. Unified_parser/ply/__pycache__/yacc.cpython-310.pyc +0 -0
  39. Unified_parser/ply/__pycache__/yacc.cpython-311.pyc +0 -0
  40. Unified_parser/ply/__pycache__/yacc.cpython-37.pyc +0 -0
  41. Unified_parser/ply/__pycache__/yacc.cpython-38.pyc +0 -0
  42. Unified_parser/ply/lex.py +110 -0
  43. Unified_parser/ply/yacc.py +0 -0
  44. Unified_parser/punjabi/extract_punjabi.py +15 -0
  45. Unified_parser/punjabi/punjabi_asr_sample +0 -0
  46. Unified_parser/punjabi/punjabi_results.txt +0 -0
  47. Unified_parser/punjabi/punjabi_words.txt +0 -0
  48. Unified_parser/punjabi/runner_punjabi.py +13 -0
  49. Unified_parser/pypi_package/LICENSE +21 -0
  50. Unified_parser/pypi_package/README.md +34 -0
Unified_parser/.vscode/tasks.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tasks": [
3
+ {
4
+ "type": "cppbuild",
5
+ "label": "C/C++: gcc build active file",
6
+ "command": "/usr/bin/gcc",
7
+ "args": [
8
+ "-fdiagnostics-color=always",
9
+ "-g",
10
+ "${file}",
11
+ "-o",
12
+ "${fileDirname}/${fileBasenameNoExtension}"
13
+ ],
14
+ "options": {
15
+ "cwd": "${fileDirname}"
16
+ },
17
+ "problemMatcher": [
18
+ "$gcc"
19
+ ],
20
+ "group": {
21
+ "kind": "build",
22
+ "isDefault": true
23
+ },
24
+ "detail": "Task generated by Debugger."
25
+ }
26
+ ],
27
+ "version": "2.0.0"
28
+ }
Unified_parser/LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2022 vikram-kv
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
Unified_parser/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python_Unified_Parser
2
+
3
+ This parser attempts to unify the languages based on the Common Label Set. It is designed across all the languages capitalising on the syllable structure of Indian languages. The Unified Parser converts UTF-8 text to common label set, applies letter-to-sound rules and generates the corresponding phoneme sequences. The effort is a step towards natural language understanding system that operates on Indian languages and generates the parsed output. This structured method requires only knowledge of the basic language. With good lexicons it is possible to get more than 95% correctness of words in a language. This method can be further extended for a number of other Indian languages in minimal time and effort. Given the unity in the diversity of Indian languages, developing parsers for new languages is easy using the unified approach.
4
+
5
+ Our python parser - [uparser.py](src/indic-unified-parser/uparser.py) - Combines lex and yacc functionality in a single python script using the [PLY](src/indic-unified-parser/ply) framework.
6
+
7
+ ## Publications
8
+ [Baby, Arun, et al. "A unified parser for developing Indian language text to speech synthesizers." Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings 19. Springer International Publishing, 2016.](https://www.iitm.ac.in/donlab/tts/downloads/unified/unified.pdf)
9
+
10
+ ## Installation
11
+
12
+ ```bash
13
+ pip install indic_unified_parser
14
+ ```
15
+
16
+ ## How to use
17
+
18
+ ```bash
19
+ from indic_unified_parser.uparser import wordparse
20
+ parsed_output_string = wordparse(<word : str>, <lsflag : int>, <wfflag : int>, <clearflag : int>)
21
+ ```
22
+
23
+ 1. `lsflag`: always 0. Deprecated.
24
+ 2. `wfflag`: 0 for Monophone parsing, 1 for syllable parsing, 2 for Akshara Parsing"
25
+ 3. `clearflag`: 1 for removing the lisp like format of output and to just produce space separated output. Otherwise, 0.
26
+
27
+ ## Examples
28
+
29
+ check run_parser_all_lang_all_opt.py file for the use of wordparse function.
30
+
31
+
32
+
33
+ ## URLS
34
+ [Homepage](https://github.com/vikram-kv/Unified_Parser)
Unified_parser/__pycache__/get_phone_mapped_python.cpython-311.pyc ADDED
Binary file (2.75 kB). View file
 
Unified_parser/__pycache__/get_phone_mapped_python.cpython-37.pyc ADDED
Binary file (1.63 kB). View file
 
Unified_parser/__pycache__/globals.cpython-310.pyc ADDED
Binary file (3.01 kB). View file
 
Unified_parser/__pycache__/globals.cpython-311.pyc ADDED
Binary file (4.65 kB). View file
 
Unified_parser/__pycache__/globals.cpython-37.pyc ADDED
Binary file (3.14 kB). View file
 
Unified_parser/__pycache__/globals.cpython-38.pyc ADDED
Binary file (3.15 kB). View file
 
Unified_parser/__pycache__/helpers.cpython-310.pyc ADDED
Binary file (21 kB). View file
 
Unified_parser/__pycache__/helpers.cpython-311.pyc ADDED
Binary file (45.8 kB). View file
 
Unified_parser/__pycache__/helpers.cpython-37.pyc ADDED
Binary file (24.3 kB). View file
 
Unified_parser/__pycache__/helpers.cpython-38.pyc ADDED
Binary file (22.8 kB). View file
 
Unified_parser/__pycache__/parallelparser.cpython-37.pyc ADDED
Binary file (5.17 kB). View file
 
Unified_parser/__pycache__/parallelparser.cpython-38.pyc ADDED
Binary file (5.33 kB). View file
 
Unified_parser/__pycache__/uparser.cpython-310.pyc ADDED
Binary file (5.22 kB). View file
 
Unified_parser/__pycache__/uparser.cpython-37.pyc ADDED
Binary file (6.19 kB). View file
 
Unified_parser/common.map ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 0 mq q q q q ऀ mq mq mq mq M
2
+ 1 mq q q q q ँ ঁ ઁ ଁ ਁ
3
+ 2 q ം ஂ ం ಂ ं ং ં ଂ ਂ
4
+ 3 h ഃ ஃ ః ಃ ः ঃ ઃ ଃ ਃ H
5
+ 4 a a a a a ऄ a a a a
6
+ 5 a അ அ అ ಅ अ অ અ ଅ ਅ
7
+ 6 aa ആ ஆ ఆ ಆ आ আ આ ଆ ਆ A
8
+ 7 i ഇ இ ఇ ಇ इ ই ઇ ଇ ਇ
9
+ 8 ii ഈ ஈ ఈ ಈ ई ঈ ઈ ଈ ਈ I
10
+ 9 u ഉ உ ఉ ಉ उ উ ઉ ଉ ਉ
11
+ 10 uu ഊ ஊ ఊ ಊ ऊ ঊ ઊ ଊ ਊ U
12
+ 11 rq ഋ rx ఋ ಋ ऋ ঋ ઋ ଋ r R
13
+ 12 uu uu uu uu uu ऌ ঌ ઌ ଌ uu
14
+ 13 ae e e e e ऍ ee ઍ ee ee ऍ
15
+ 14 e എ எ ఎ ಎ ऎ ee ee ee ee
16
+ 15 ee ഏ ஏ ఏ ಏ ए এ એ ଏ ਏ E
17
+ 16 ei ഐ ஐ ఐ ಐ ऐ ঐ ઐ ଐ ਐ ऐ
18
+ 17 ax a a a a ऑ a ઑ a a ऑ
19
+ 18 o ഒ ஒ ఒ ಒ ऒ oo oo oo oo
20
+ 19 oo ഓ ஓ ఓ ಓ ओ ও ઓ ଓ ਓ O
21
+ 20 ou ഔ ஔ ఔ ಔ औ ঔ ઔ ଔ ਔ औ
22
+ 21 k ക க క ಕ क ক ક କ ਕ
23
+ 22 kh ഖ k ఖ ಖ ख খ ખ ଖ ਖ ख
24
+ 23 g ഗ k గ ಗ ग গ ગ ଗ ਗ
25
+ 24 gh ഘ k ఘ ಘ घ ঘ ઘ ଘ ਘ घ
26
+ 25 ng ങ ங ఙ ಙ ङ ঙ ઙ ଙ ਙ ङ
27
+ 26 c ച ச చ ಚ च চ ચ ଚ ਚ
28
+ 27 ch ഛ c ఛ ಛ छ ছ છ ଛ ਛ C
29
+ 28 j ജ ஜ జ ಜ ज জ જ ଜ ਜ
30
+ 29 jh ഝ j ఝ ಝ झ ঝ ઝ ଝ ਝ J
31
+ 30 nj ഞ ஞ ఞ ಞ ञ ঞ ઞ ଞ ਞ ञ
32
+ 31 tx ട ட ట ಟ ट ট ટ ଟ ਟ ट
33
+ 32 txh ഠ tx ఠ ಠ ठ ঠ ઠ ଠ ਠ ठ
34
+ 33 dx ഡ tx డ ಡ ड ড ડ ଡ ਡ ड
35
+ 34 dxh ഢ tx ఢ ಢ ढ ঢ ઢ ଢ ਢ ढ
36
+ 35 nx ണ ண ణ ಣ ण ণ ણ ଣ ਣ ण
37
+ 36 t ത த త ತ त ত ત ତ ਤ
38
+ 37 th ഥ t థ ಥ थ থ થ ଥ ਥ थ
39
+ 38 d ദ t ద ದ द দ દ ଦ ਦ
40
+ 39 dh ധ t ధ ಧ ध ধ ધ ଧ ਧ ध
41
+ 40 n ന ந న ನ न ন ન ନ ਨ
42
+ 41 nd ഩ ன n n ऩ n n n n न
43
+ 42 p പ ப ప ಪ प প પ ପ ਪ
44
+ 43 ph ഫ p ఫ ಫ फ ফ ફ ଫ ਫ P
45
+ 44 b ബ p బ ಬ ब ব બ ବ ਬ
46
+ 45 bh ഭ p భ ಭ भ ভ ભ ଭ ਭ B
47
+ 46 m മ ம మ ಮ म ম મ ମ ਮ
48
+ 47 y യ ய య ಯ य য ય ୟ ਯ
49
+ 48 r ര ர ర ರ र র ર ର ਰ
50
+ 49 rx റ ற r r ऱ r r r r र
51
+ 50 l ല ல ల ಲ ल ল લ ଲ ਲ
52
+ 51 lx ള ள ళ ಳ ळ l ળ ଳ ਲ਼ ള
53
+ 52 zh ഴ ழ lx lx ऴ lx lx lx lx Z
54
+ 53 w വ வ వ ವ व b વ ଵ ਵ
55
+ 54 sh ശ ஶ శ ಶ श শ શ ଶ ਸ਼ श
56
+ 55 sx ഷ ஷ ష ಷ ष ষ ષ ଷ # ष
57
+ 56 s സ ஸ స ಸ स স સ ସ ਸ
58
+ 57 h ഹ ஹ హ ಹ ह হ હ ହ ਹ
59
+ 58 a a a a a ऺ a a a a
60
+ 59 aav aav aav aav aav ऻ aav aav aav aav
61
+ 60 nk a a a a ़ ় ઼ ଼ ਼ Y
62
+ 61 ag a a a a ऽ ঽ ઽ ଽ # ऽ
63
+ 62 aav ാ ா ా ಾ ा া ા ା ਾ
64
+ 63 iv ി ி ి ಿ ि ি િ ି ਿ
65
+ 64 iiv ീ ீ ీ ೀ ी ী ી ୀ ੀ
66
+ 65 uv ു ு ు ು ु ু ુ ୁ ੁ
67
+ 66 uuv ൂ ூ ూ ೂ ू ূ ૂ ୂ ੂ
68
+ 67 rqv ൃ uv ృ ೃ ृ ৃ ૃ ୃ uv
69
+ 68 rqwv ൄ uuv ౄ ೄ ॄ ৄ ૄ rqv uuv ॠ
70
+ 69 aev ev ev ev ev ॅ eev eev eev eev
71
+ 70 ev െ ெ ె ೆೆ ॆ eev eev ୄ eev
72
+ 71 eev േ ே ే ೇ े ে ે େ ੇ
73
+ 72 eiv ൈ ை ై ೇೈ ै ৈ ૈ ୈ ੈ ऐ
74
+ 73 axv aav aav aav aav ॉ aav ૉ aav aav ऑ
75
+ 74 ov ൊ ொ ొ ೊ ॊ oov oov oov oov
76
+ 75 oov ോ ோ ో ೋ ो ো ો ୋ ੋ O
77
+ 76 ouv ൌ ௌ ౌ ೌ ौ ৌ ૌ ୌ ੌ औ
78
+ 77 eu ് ் ్ ್ ् ্ ્ ୍ ੍ உ
79
+ 78 tv a a a a ॎ ৎ a a a
80
+ 79 $ ouv ouv ouv ouv ॏ ouv ouv ouv ouv
81
+ 80 $ # # # # ॐ # ૐ # ੴ
82
+ 81 $ a a a a ॓ a a a a
83
+ 82 $ a a a a ॔ a a a a
84
+ 83 $ # # # # # # # # #
85
+ 84 $ # # # # # # # # #
86
+ 85 aav a a a a ॕ a a a a
87
+ 86 aav a a a a ॖ a a ୖ a
88
+ 87 auv ൗ a a a ॗ ৗ a ୗ a औ
89
+ 88 kq k k k k क़ k k k k क
90
+ 89 khq kh kh kh kh ख़ kh kh kh ਖ਼ K
91
+ 90 gq g g g g ग़ g g g ਗ਼ G
92
+ 91 z j j j j ज़ j j j ਜ਼
93
+ 92 dxq dx dx dx dx ड़ ড় dx ଡ଼ ੜ D
94
+ 93 dxhq dxh dxh dxh dxh ढ़ ঢ় dxh ଢ଼ dxh T
95
+ 94 f f f f f फ़ f f f ਫ਼
96
+ 95 y y y y y य़ য় y ୟ y
97
+ 96 rqw ൠ ற ౠ ೠ ॠ ৠ ૠ ୠ r ॠ
98
+ 97 $ # # # # ॡ ৡ ૡ ୡ #
99
+ 98 $ # # # # ॢ ৢ ૢ # #
100
+ 99 $ # # # # ॣ ৣ ૣ ୢ #
101
+ 100 $ # # # # । # # # #
102
+ 101 $ # # # # ॥ # # ୣ #
103
+ 102 0 ൦ ௦ ౦ ೦ ० ০ ૦ ୦ ੦
104
+ 103 1 ൧ ௧ ౧ ೧ १ ১ ૧ ୧ ੧
105
+ 104 2 ൨ ௨ ౨ ೨ २ ২ ૨ ୨ ੨
106
+ 105 3 ൩ ௩ ౩ ೩ ३ ৩ ૩ ୩ ੩
107
+ 106 4 ൪ ௪ ౪ ೪ ४ ৪ ૪ ୪ ੪
108
+ 107 5 ൫ ௫ ౫ ೫ ५ ৫ ૫ ୫ ੫
109
+ 108 6 ൬ ௬ ౬ ೬ ६ ৬ ૬ ୬ ੬
110
+ 109 7 ൭ ௭ ౭ ೭ ७ ৭ ૭ ୭ ੭
111
+ 110 8 ൮ ௮ ౮ ೮ ८ ৮ ૮ ୮ ੮
112
+ 111 9 ൯ ௯ ౯ ೯ ९ ৯ ૯ ୯ ੯
113
+ 112 rv r r r r ॰ ৰ ૰ ୰ r
114
+ 113 wv w w w w ॱ ৱ ૱ ୱ w W
115
+ 114 $ a a a a ॲ ৲ a ୲ a
116
+ 115 $ a a a a ॳ ৳ a ୳ a
117
+ 116 $ aa aa aa aa ॴ ৴ aa ୴ aa
118
+ 117 $ ou ou ou ou ॵ ৵ ou ୵ ou
119
+ 118 $ a a a a ॶ ৶ a ୶ a
120
+ 119 $ a a a a ॷ ৷ a ୷ a
121
+ 120 $ dx dx dx dx ॸ ৸ dx dx dx
122
+ 121 $ j j j j ॹ ৹ z z z
123
+ 122 nwv ൺ nx nx nx ॺ ৺ y y y ൺ
124
+ 123 nnv ൻ n n n ॻ ৻ g g g N
125
+ 124 rwv ർ rx rx rx ॼ j j j j ർ
126
+ 125 lwv ൽ l l l ॽ sp sp sp sp ൽ
127
+ 126 lnv ൾ l l l ॾ dx dx dx dx ൾ
128
+ 127 $ b b b b ॿ b b b b
Unified_parser/common_hindi.map ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 0 mq q q q q ऀ mq mq mq mq M
2
+ 1 mq q q q q ँ ঁ ઁ ଁ ਁ
3
+ 2 q ം ஂ m ಂ ं ং ં ଂ ਂ
4
+ 3 hq ഃ ஃ ః ಃ ः ঃ ઃ ଃ ਃ H
5
+ 4 a a a aa a ऄ a a a a
6
+ 5 a അ அ అ ಅ o অ અ ଅ ਅ
7
+ 6 aa ആ ஆ ఆ ಆ आ আ આ ଆ ਆ A
8
+ 7 i ഇ இ ఇ ಇ इ ই ઇ ଇ ਇ
9
+ 8 ii ഈ ஈ ఈ ಈ ई ঈ ઈ ଈ ਈ I
10
+ 9 u ഉ உ ఉ ಉ उ উ ઉ ଉ ਉ
11
+ 10 uu ഊ ஊ ఊ ಊ ऊ ঊ ઊ ଊ ਊ U
12
+ 11 rq ഋ rx ఋ ಋ ऋ ঋ ઋ ଋ r R
13
+ 12 uu uu uu uu uu l&i ঌ ઌ ଌ uu
14
+ 13 ae e e e e ऍ ee ઍ ee ee ऍ
15
+ 14 e എ எ ఎ ಎ ऎ ee ee ee ee
16
+ 15 ee ഏ ஏ ఏ ಏ ए এ એ ଏ ਏ E
17
+ 16 ei ഐ ஐ ఐ ಐ o&i ঐ ઐ ଐ ਐ ऐ
18
+ 17 ax a a a a ऑ a ઑ a a ऑ
19
+ 18 o ഒ ஒ ఒ ಒ ऒ oo oo oo oo
20
+ 19 oo ഓ ஓ ఓ ಓ ओ ও ઓ ଓ ਓ O
21
+ 20 ou ഔ ஔ ఔ ಔ औ ঔ ઔ ଔ ਔ औ
22
+ 21 k ക க క ಕ क ক ક କ ਕ
23
+ 22 kh ഖ k ఖ ಖ ख খ ખ ଖ ਖ ख
24
+ 23 g ഗ k గ ಗ ग গ ગ ଗ ਗ
25
+ 24 gh ഘ k ఘ ಘ घ ঘ ઘ ଘ ਘ घ
26
+ 25 ng ങ ங ఙ ಙ ङ ঙ ઙ ଙ ਙ ङ
27
+ 26 c ച ச చ ಚ च চ ચ ଚ ਚ
28
+ 27 ch ഛ c ఛ ಛ छ ছ છ ଛ ਛ C
29
+ 28 j ജ ஜ జ ಜ ज জ જ ଜ ਜ
30
+ 29 jh ഝ j ఝ ಝ झ ঝ ઝ ଝ ਝ J
31
+ 30 nj ഞ ஞ ఞ ಞ ञ ঞ ઞ ଞ ਞ ञ
32
+ 31 tx ട ட ట ಟ ट ট ટ ଟ ਟ ट
33
+ 32 txh ഠ tx ఠ ಠ ठ ঠ ઠ ଠ ਠ ठ
34
+ 33 dx ഡ tx డ ಡ ड ড ડ ଡ ਡ ड
35
+ 34 dxh ഢ tx ఢ ಢ ढ ঢ ઢ ଢ ਢ ढ
36
+ 35 nx ണ ண ణ ಣ ण ণ ણ ଣ ਣ ण
37
+ 36 t ത த త ತ त ত ત ତ ਤ
38
+ 37 th ഥ t థ ಥ थ থ થ ଥ ਥ थ
39
+ 38 d ദ t ద ದ द দ દ ଦ ਦ
40
+ 39 dh ധ t ధ ಧ ध ধ ધ ଧ ਧ ध
41
+ 40 n ന ந న ನ न ন ન ନ ਨ
42
+ 41 nd ഩ ன n n ऩ n n n n न
43
+ 42 p പ ப ప ಪ प প પ ପ ਪ
44
+ 43 ph ഫ p ఫ ಫ फ ফ ફ ଫ ਫ P
45
+ 44 b ബ p బ ಬ ब ব બ ବ ਬ
46
+ 45 bh ഭ p భ ಭ भ ভ ભ ଭ ਭ B
47
+ 46 m മ ம మ ಮ म ম મ ମ ਮ
48
+ 47 y യ ய య ಯ j য ય ୟ ਯ
49
+ 48 r ര ர ర ರ र র ર ର ਰ
50
+ 49 rx റ ற r r ऱ r r r r र
51
+ 50 l ല ல ల ಲ ल ল લ ଲ ਲ
52
+ 51 lx ള ள ళ ಳ ळ l ળ ଳ ਲ਼ ള
53
+ 52 zh ഴ ழ lx lx ऴ lx lx lx lx Z
54
+ 53 w വ வ వ ವ व b વ ଵ ਵ
55
+ 54 sh ശ ஶ sx ಶ श শ શ ଶ ਸ਼ श
56
+ 55 sx ഷ ஷ ష ಷ ष ষ ષ ଷ # ष
57
+ 56 s സ ஸ స ಸ स স સ ସ ਸ
58
+ 57 h ഹ ஹ హ ಹ ह হ હ ହ ਹ
59
+ 58 a a a a a ऺ a a a a
60
+ 59 aav aav aav aav aav ऻ aav aav aav aav
61
+ 60 nk a a a a ़ ় ઼ ଼ ਼ Y
62
+ 61 ag a a a a ऽ ঽ ઽ ଽ # ऽ
63
+ 62 aav ാ ா ా ಾ ा া ા ା ਾ
64
+ 63 iv ി ி ి ಿ ि ি િ ି ਿ
65
+ 64 iiv ീ ீ ీ ೀ ी ী ી ୀ ੀ
66
+ 65 uv ു ு ు ು ु ু ુ ୁ ੁ
67
+ 66 uuv ൂ ூ ూ ೂ ू ূ ૂ ୂ ੂ
68
+ 67 rqv ൃ uv ృ ೃ ृ ৃ ૃ ୃ uv
69
+ 68 rqwv ൄ uuv ౄ ೄ ॄ ৄ ૄ rqv uuv ॠ
70
+ 69 aev ev ev ev ev ॅ eev eev eev eev
71
+ 70 ev െ ெ ె ೆೆ ॆ eev eev ୄ eev
72
+ 71 eev േ ே ే ೇ े ে ે େ ੇ
73
+ 72 eiv ൈ ை ై ೇೈ ै ৈ ૈ ୈ ੈ ऐ
74
+ 73 axv aav aav aav aav ॉ aav ૉ aav aav ऑ
75
+ 74 ov ൊ ொ ొ ೊ ॊ oov oov oov oov
76
+ 75 oov ോ ோ ో ೋ ो ো ો ୋ ੋ O
77
+ 76 ouv ൌ ௌ ౌ ೌ ौ ৌ ૌ ୌ ੌ औ
78
+ 77 eu ് ் ్ ್ ् ্ ્ ୍ ੍ உ
79
+ 78 tv a a a a ॎ ৎ a a a
80
+ 79 $ ouv ouv ouv ouv ॏ ouv ouv ouv ouv
81
+ 80 $ # # # # ॐ o&u&m ૐ # ੴ
82
+ 81 $ a a a a ॓ a a a a
83
+ 82 $ a a a a ॔ a a a a
84
+ 83 $ # # # # # # # # #
85
+ 84 $ # # # # # # # # #
86
+ 85 aav a a a a ॕ a a a a
87
+ 86 aav a a a a ॖ a a ୖ a
88
+ 87 auv ൗ a a a ॗ ৗ a ୗ a औ
89
+ 88 kq k k k k क़ k k k k क
90
+ 89 khq kh kh kh kh ख़ kh kh kh ਖ਼ K
91
+ 90 gq g g g g ग़ g g g ਗ਼ G
92
+ 91 z j j j j ज़ j j j ਜ਼
93
+ 92 dxq dx dx dx dx ड़ rx dx ଡ଼ ੜ D
94
+ 93 dxhq dxh dxh dxh dxh ढ़ ঢ় dxh ଢ଼ dxh T
95
+ 94 f f f f f फ़ f f f ਫ਼
96
+ 95 y y y y y य़ য় y ୟ y
97
+ 96 rqw ൠ ற ౠ ೠ ॠ ৠ ૠ ୠ r ॠ
98
+ 97 $ # # # # ॡ ৡ ૡ ୡ #
99
+ 98 $ # # # # ॢ ৢ ૢ # #
100
+ 99 $ # # # # ॣ ৣ ૣ ୢ #
101
+ 100 $ # # # # । # # # #
102
+ 101 $ # # # # ॥ # # ୣ #
103
+ 102 0 ൦ ௦ ౦ ೦ ० ০ ૦ ୦ ੦
104
+ 103 1 ൧ ௧ ౧ ೧ १ ১ ૧ ୧ ੧
105
+ 104 2 ൨ ௨ ౨ ೨ २ ২ ૨ ୨ ੨
106
+ 105 3 ൩ ௩ ౩ ೩ ३ ৩ ૩ ୩ ੩
107
+ 106 4 ൪ ௪ ౪ ೪ ४ ৪ ૪ ୪ ੪
108
+ 107 5 ൫ ௫ ౫ ೫ ५ ৫ ૫ ୫ ੫
109
+ 108 6 ൬ ௬ ౬ ೬ ६ ৬ ૬ ୬ ੬
110
+ 109 7 ൭ ௭ ౭ ೭ ७ ৭ ૭ ୭ ੭
111
+ 110 8 ൮ ௮ ౮ ೮ ८ ৮ ૮ ୮ ੮
112
+ 111 9 ൯ ௯ ౯ ೯ ९ ৯ ૯ ୯ ੯
113
+ 112 rv r r r r ॰ ৰ ૰ ୰ r
114
+ 113 wv w w w w ॱ ৱ ૱ ୱ w W
115
+ 114 $ a a a a ॲ ৲ a ୲ a
116
+ 115 $ a a a a ॳ ৳ a ୳ a
117
+ 116 $ aa aa aa aa ॴ ৴ aa ୴ aa
118
+ 117 $ ou ou ou ou ॵ ৵ ou ୵ ou
119
+ 118 $ a a a a ॶ ৶ a ୶ a
120
+ 119 $ a a a a ॷ ৷ a ୷ a
121
+ 120 $ dx dx dx dx ॸ ৸ dx dx dx
122
+ 121 $ j j j j ॹ ৹ z z z
123
+ 122 nwv ൺ nx nx nx ॺ ৺ y y y ൺ
124
+ 123 nnv ൻ n n n ॻ ৻ g g g N
125
+ 124 rwv ർ rx rx rx ॼ j j j j ർ
126
+ 125 lwv ൽ l l l ॽ sp sp sp sp ൽ
127
+ 126 lnv ൾ l l l ॾ dx dx dx dx ൾ
128
+ 127 $ b b b b ॿ b b b b
Unified_parser/common_telugu.map ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 0 mq q q q q ऀ mq mq mq mq M
2
+ 1 mq q q q q ँ ঁ ઁ ଁ ਁ
3
+ 2 q ം ஂ ం ಂ ं ং ં ଂ ਂ
4
+ 3 hq ഃ ஃ ః ಃ ः ঃ ઃ ଃ ਃ H
5
+ 4 a a a a a ऄ a a a a
6
+ 5 a അ அ అ ಅ अ অ અ ଅ ਅ
7
+ 6 aa ആ ஆ ఆ ಆ आ আ આ ଆ ਆ A
8
+ 7 i ഇ இ ఇ ಇ इ ই ઇ ଇ ਇ
9
+ 8 ii ഈ ஈ ఈ ಈ ई ঈ ઈ ଈ ਈ I
10
+ 9 u ഉ உ ఉ ಉ उ উ ઉ ଉ ਉ
11
+ 10 uu ഊ ஊ ఊ ಊ ऊ ঊ ઊ ଊ ਊ U
12
+ 11 rq ഋ rx ఋ ಋ ऋ ঋ ઋ ଋ r R
13
+ 12 uu uu uu uu uu ऌ ঌ ઌ ଌ uu
14
+ 13 ae e e e e ऍ ee ઍ ee ee ऍ
15
+ 14 e എ எ ఎ ಎ ऎ ee ee ee ee
16
+ 15 ee ഏ ஏ ఏ ಏ ए এ એ ଏ ਏ E
17
+ 16 ei ഐ ஐ ee ಐ ऐ ঐ ઐ ଐ ਐ ऐ
18
+ 17 ax a a a a ऑ a ઑ a a ऑ
19
+ 18 o ഒ ஒ ఒ ಒ ऒ oo oo oo oo
20
+ 19 oo ഓ ஓ ఓ ಓ ओ ও ઓ ଓ ਓ O
21
+ 20 ou ഔ ஔ ఔ ಔ औ ঔ ઔ ଔ ਔ औ
22
+ 21 k ക க క ಕ क ক ક କ ਕ
23
+ 22 kh ഖ k ఖ ಖ ख খ ખ ଖ ਖ ख
24
+ 23 g ഗ k గ ಗ ग গ ગ ଗ ਗ
25
+ 24 gh ഘ k ఘ ಘ घ ঘ ઘ ଘ ਘ घ
26
+ 25 ng ങ ங ఙ ಙ ङ ঙ ઙ ଙ ਙ ङ
27
+ 26 c ച ச చ ಚ च চ ચ ଚ ਚ
28
+ 27 ch ഛ c ఛ ಛ छ ছ છ ଛ ਛ C
29
+ 28 j ജ ஜ జ ಜ ज জ જ ଜ ਜ
30
+ 29 jh ഝ j ఝ ಝ झ ঝ ઝ ଝ ਝ J
31
+ 30 nj ഞ ஞ ఞ ಞ ञ ঞ ઞ ଞ ਞ ञ
32
+ 31 tx ട ட ట ಟ ट ট ટ ଟ ਟ ट
33
+ 32 txh ഠ tx ఠ ಠ ठ ঠ ઠ ଠ ਠ ठ
34
+ 33 dx ഡ tx డ ಡ ड ড ડ ଡ ਡ ड
35
+ 34 dxh ഢ tx ఢ ಢ ढ ঢ ઢ ଢ ਢ ढ
36
+ 35 nx ണ ண ణ ಣ ण ণ ણ ଣ ਣ ण
37
+ 36 t ത த త ತ त ত ત ତ ਤ
38
+ 37 th ഥ t థ ಥ थ থ થ ଥ ਥ थ
39
+ 38 d ദ t ద ದ द দ દ ଦ ਦ
40
+ 39 dh ധ t ధ ಧ ध ধ ધ ଧ ਧ ध
41
+ 40 n ന ந న ನ न ন ન ନ ਨ
42
+ 41 nd ഩ ன n n ऩ n n n n न
43
+ 42 p പ ப ప ಪ प প પ ପ ਪ
44
+ 43 ph ഫ p ఫ ಫ फ ফ ફ ଫ ਫ P
45
+ 44 b ബ p బ ಬ ब ব બ ବ ਬ
46
+ 45 bh ഭ p భ ಭ भ ভ ભ ଭ ਭ B
47
+ 46 m മ ம మ ಮ म ম મ ମ ਮ
48
+ 47 y യ ய య ಯ य য ય ୟ ਯ
49
+ 48 r ര ர ర ರ र র ર ର ਰ
50
+ 49 rx റ ற r r ऱ r r r r र
51
+ 50 l ല ல ల ಲ ल ল લ ଲ ਲ
52
+ 51 lx ള ள ళ ಳ ळ l ળ ଳ ਲ਼ ള
53
+ 52 zh ഴ ழ lx lx ऴ lx lx lx lx Z
54
+ 53 w വ வ వ ವ व b વ ଵ ਵ
55
+ 54 sh ശ ஶ శ ಶ श শ શ ଶ ਸ਼ श
56
+ 55 sx ഷ ஷ ష ಷ ष ষ ષ ଷ # ष
57
+ 56 s സ ஸ స ಸ स স સ ସ ਸ
58
+ 57 h ഹ ஹ హ ಹ ह হ હ ହ ਹ
59
+ 58 a a a a a ऺ a a a a
60
+ 59 aav aav aav aav aav ऻ aav aav aav aav
61
+ 60 nk a a a a ़ ় ઼ ଼ ਼ Y
62
+ 61 ag a a a a ऽ ঽ ઽ ଽ # ऽ
63
+ 62 aav ാ ா ా ಾ ा া ા ା ਾ
64
+ 63 iv ി ி ి ಿ ि ি િ ି ਿ
65
+ 64 iiv ീ ீ ీ ೀ ी ী ી ୀ ੀ
66
+ 65 uv ു ு ు ು ु ু ુ ୁ ੁ
67
+ 66 uuv ൂ ூ ూ ೂ ू ূ ૂ ୂ ੂ
68
+ 67 rqv ൃ uv ృ ೃ ृ ৃ ૃ ୃ uv
69
+ 68 rqwv ൄ uuv ౄ ೄ ॄ ৄ ૄ rqv uuv ॠ
70
+ 69 aev ev ev ev ev ॅ eev eev eev eev
71
+ 70 ev െ ெ ె ೆೆ ॆ eev eev ୄ eev
72
+ 71 eev േ ே ే ೇ े ে ે େ ੇ
73
+ 72 eiv ൈ ை eev ೇೈ ै ৈ ૈ ୈ ੈ ऐ
74
+ 73 axv aav aav aav aav ॉ aav ૉ aav aav ऑ
75
+ 74 ov ൊ ொ ొ ೊ ॊ oov oov oov oov
76
+ 75 oov ോ ோ ో ೋ ो ো ો ୋ ੋ O
77
+ 76 ouv ൌ ௌ ౌ ೌ ौ ৌ ૌ ୌ ੌ औ
78
+ 77 eu ് ் ్ ್ ् ্ ્ ୍ ੍ உ
79
+ 78 tv a a a a ॎ ৎ a a a
80
+ 79 $ ouv ouv ouv ouv ॏ ouv ouv ouv ouv
81
+ 80 $ # # # # ॐ # ૐ # ੴ
82
+ 81 $ a a a a ॓ a a a a
83
+ 82 $ a a a a ॔ a a a a
84
+ 83 $ # # # # # # # # #
85
+ 84 $ # # # # # # # # #
86
+ 85 aav a a a a ॕ a a a a
87
+ 86 aav a a a a ॖ a a ୖ a
88
+ 87 auv ൗ a a a ॗ ৗ a ୗ a औ
89
+ 88 kq k k k k क़ k k k k क
90
+ 89 khq kh kh kh kh ख़ kh kh kh ਖ਼ K
91
+ 90 gq g g g g ग़ g g g ਗ਼ G
92
+ 91 z j j j j ज़ j j j ਜ਼
93
+ 92 dxq dx dx dx dx ड़ ড় dx ଡ଼ ੜ D
94
+ 93 dxhq dxh dxh dxh dxh ढ़ ঢ় dxh ଢ଼ dxh T
95
+ 94 f f f f f फ़ f f f ਫ਼
96
+ 95 y y y y y य़ য় y ୟ y
97
+ 96 rqw ൠ ற ౠ ೠ ॠ ৠ ૠ ୠ r ॠ
98
+ 97 $ # # # # ॡ ৡ ૡ ୡ #
99
+ 98 $ # # # # ॢ ৢ ૢ # #
100
+ 99 $ # # # # ॣ ৣ ૣ ୢ #
101
+ 100 $ # # # # । # # # #
102
+ 101 $ # # # # ॥ # # ୣ #
103
+ 102 0 ൦ ௦ ౦ ೦ ० ০ ૦ ୦ ੦
104
+ 103 1 ൧ ௧ ౧ ೧ १ ১ ૧ ୧ ੧
105
+ 104 2 ൨ ௨ ౨ ೨ २ ২ ૨ ୨ ੨
106
+ 105 3 ൩ ௩ ౩ ೩ ३ ৩ ૩ ୩ ੩
107
+ 106 4 ൪ ௪ ౪ ೪ ४ ৪ ૪ ୪ ੪
108
+ 107 5 ൫ ௫ ౫ ೫ ५ ৫ ૫ ୫ ੫
109
+ 108 6 ൬ ௬ ౬ ೬ ६ ৬ ૬ ୬ ੬
110
+ 109 7 ൭ ௭ ౭ ೭ ७ ৭ ૭ ୭ ੭
111
+ 110 8 ൮ ௮ ౮ ೮ ८ ৮ ૮ ୮ ੮
112
+ 111 9 ൯ ௯ ౯ ೯ ९ ৯ ૯ ୯ ੯
113
+ 112 rv r r r r ॰ ৰ ૰ ୰ r
114
+ 113 wv w w w w ॱ ৱ ૱ ୱ w W
115
+ 114 $ a a a a ॲ ৲ a ୲ a
116
+ 115 $ a a a a ॳ ৳ a ୳ a
117
+ 116 $ aa aa aa aa ॴ ৴ aa ୴ aa
118
+ 117 $ ou ou ou ou ॵ ৵ ou ୵ ou
119
+ 118 $ a a a a ॶ ৶ a ୶ a
120
+ 119 $ a a a a ॷ ৷ a ୷ a
121
+ 120 $ dx dx dx dx ॸ ৸ dx dx dx
122
+ 121 $ j j j j ॹ ৹ z z z
123
+ 122 nwv ൺ nx nx nx ॺ ৺ y y y ൺ
124
+ 123 nnv ൻ n n n ॻ ৻ g g g N
125
+ 124 rwv ർ rx rx rx ॼ j j j j ർ
126
+ 125 lwv ൽ l l l ॽ sp sp sp sp ൽ
127
+ 126 lnv ൾ l l l ॾ dx dx dx dx ൾ
128
+ 127 $ b b b b ॿ b b b b
Unified_parser/dict/english.dict ADDED
The diff for this file is too large to render. See raw diff
 
Unified_parser/dict/english.dict_old ADDED
@@ -0,0 +1 @@
 
 
1
+ english noun ( (( "Ei" "n" "g" ) 0)(( "l" "i" "sh" ) 0) )
Unified_parser/dict/hindi.dict1 ADDED
@@ -0,0 +1 @@
 
 
1
+ अंगारित ( (( "अं" ) 0) (( "गा" ) 0) (( "रित्" ) 0) ) ( (( "a" "q" ) 0) (( "g" "aa" ) 0) (( "r" "i" "t" ) 0) )
Unified_parser/dict/malayalam.dict ADDED
@@ -0,0 +1 @@
 
 
1
+ സ്ത്രീ ( (( "സ്ത്രീ" ) 0) ) ( (( "s" "t" "r" "ii" ) 0) )
Unified_parser/extract_words.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os, shutil
2
+ from uparser import wordparse
3
+ from joblib import Parallel, delayed
4
+ from tqdm import tqdm
5
+
6
+ num_jobs = 20
7
+ infolder = 'Original'
8
+ outfolder = 'Words'
9
+
10
+ for fdr in [outfolder]:
11
+ if os.path.exists(fdr):
12
+ shutil.rmtree(fdr)
13
+ os.mkdir(fdr)
14
+
15
+ flist = os.listdir(infolder)
16
+ for fname in flist:
17
+ with open(f'{infolder}/{fname}', 'r') as f:
18
+ cnts = f.readlines()
19
+
20
+ i = 0
21
+
22
+ words = []
23
+ for l in cnts:
24
+ l = l.strip().split('\t')
25
+ words.append(l[0])
26
+
27
+ fout = fname.split('_')[1]
28
+ fout = fout.split('.')[0]
29
+ print(fout)
30
+
31
+ with open(f'{outfolder}/{fout}.words', 'w') as f:
32
+ for w in words:
33
+ f.write(w + '\n')
Unified_parser/get_phone_mapped_python.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ class TextReplacer:
2
+ def __init__(self):
3
+ self.replacements = {
4
+ 'aa':'A',
5
+ 'ae':'ऍ',
6
+ 'ag':'ऽ',
7
+ 'ai':'ऐ',
8
+ 'au':'औ',
9
+ 'axx':'अ',
10
+ 'ax':'ऑ',
11
+ 'bh':'B',
12
+ 'ch':'C',
13
+ 'dh':'ध',
14
+ 'dxhq':'T',
15
+ 'dxh':'ढ',
16
+ 'dxq':'D',
17
+ 'dx':'ड',
18
+ 'ee':'E',
19
+ 'ei':'ऐ',
20
+ 'eu':'உ',
21
+ 'gh':'घ',
22
+ 'gq':'G',
23
+ 'hq':'H',
24
+ 'ii':'I',
25
+ 'jh':'J',
26
+ 'khq':'K',
27
+ 'kh':'ख',
28
+ 'kq':'क',
29
+ 'ln':'ൾ',
30
+ 'lw':'ൽ',
31
+ 'lx':'ള',
32
+ 'mq':'M',
33
+ 'nd':'ऩ',
34
+ 'ng':'ङ',
35
+ 'nj':'ञ',
36
+ 'nk':'Y',
37
+ 'nn':'N',
38
+ 'nw':'ൺ',
39
+ 'nx':'ण',
40
+ 'oo':'O',
41
+ 'ou':'औ',
42
+ 'ph':'P',
43
+ 'rqw':'ॠ',
44
+ 'rq':'R',
45
+ 'rw':'ർ',
46
+ 'rx':'ऱ',
47
+ 'sh':'श',
48
+ 'sx':'ष',
49
+ 'txh':'ठ',
50
+ 'th':'थ',
51
+ 'tx':'ट',
52
+ 'uu':'U',
53
+ 'wv':'W',
54
+ 'zh':'Z'
55
+
56
+ # ... Add more replacements as needed
57
+ }
58
+
59
+ def apply_replacements(self, text):
60
+ for key, value in self.replacements.items():
61
+ # print('KEY AND VALUE OF PARSED OUTPUT',key, value)
62
+ text = text.replace(key, value)
63
+ temp=""
64
+ for i in range(len(text)):
65
+ if text[i]!=" ":
66
+ temp=temp+text[i]
67
+
68
+ return temp
69
+
70
+ def apply_replacements_by_phonems(self, text):
71
+ ans=self.replacements[text]
72
+ # for key, value in self.replacements.items():
73
+ # # print('KEY AND VALUE OF PARSED OUTPUT',key, value)
74
+ # text = text.replace(key, value)
75
+ return ans
76
+
Unified_parser/globals.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # global CONSTANTs for languages. Uses the same values as the enum at
2
+ # lines 11-13 of unified.y
3
+
4
+ import sys, os
5
+ SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
6
+
7
+ class FLAGS:
8
+ DEBUG = False
9
+ parseLevel = 0
10
+ syllTagFlag = 0
11
+ LangSpecificCorrectionFlag = 1
12
+ writeFormat = 0
13
+
14
+ class WORDS:
15
+ wordCopy = ""
16
+ syllabifiedWord = ""
17
+ phonifiedWord = ""
18
+ unicodeWord = ""
19
+ syllabifiedWordOut = ""
20
+ outputText = ""
21
+
22
+ class STRINGS:
23
+ bi = 0
24
+ leftStr = ['' for _ in range(1100)]
25
+ rightStr = ['' for _ in range(1100)]
26
+ def refresh(self):
27
+ self.leftStr = ['' for _ in range(1100)]
28
+ self.rightStr = ['' for _ in range(1100)]
29
+ self.bi = 0
30
+
31
+ class GLOBALS:
32
+ def __init__(self):
33
+ self.flags = FLAGS()
34
+ self.words = WORDS()
35
+ self.combvars = STRINGS()
36
+
37
+ self.MALAYALAM = 1
38
+ self.TAMIL = 2
39
+ self.TELUGU = 3
40
+ self.KANNADA = 4
41
+ self.HINDI = 5
42
+ self.BENGALI = 6
43
+ self.GUJARATHI = 7
44
+ self.ODIYA = 8
45
+ self.PUNJABI = 9
46
+ self.ENGLISH = 10 # new value from 9 to 10
47
+
48
+ self.langId = 0
49
+ self.isSouth = False
50
+ self.syllableCount = 0
51
+
52
+ self.rootPath = SCRIPT_DIR+'/'
53
+ self.commonFile = "common.map"
54
+ self.outputFile = ""
55
+
56
+ self.symbolTable = [['' for _ in range(2)] for _ in range(128)]
57
+ self.ROW = 128
58
+ self.COL = 2
59
+ self.syllableList = []
60
+
61
+ self.VOWELSSIZE=18
62
+ self.CONSONANTSSIZE=34
63
+ self.SEMIVOWELSSIZE=4
64
+
65
+ self.VOWELS = ["a","e","i","o","u","aa","mq","aa","ii", "uu","rq","au","ee","ei","ou","oo","ax","ai"]
66
+ self.CONSONANTS = ["k","kh","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","sh","sx","zh","s","h","lx","rx","f","dxq"]
67
+ self.SEMIVOWELS = ["y","w","r","l",]
68
+
69
+ # variable to indicate current language being parsed.
70
+ self.currLang = self.ENGLISH
71
+ self.answer = ''
Unified_parser/helpers.py ADDED
@@ -0,0 +1,1031 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # import sys, os
2
+ # SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
3
+ # sys.path.append(SCRIPT_DIR)
4
+
5
+ from Unified_parser.globals import *
6
+ # contains helper functions used in parser.py
7
+
8
+ # repeated replacement of a subtring sub with tar in input until no change happens
9
+ def rec_replace(input : str, sub : str, tar : str):
10
+ while True:
11
+ output = input.replace(sub, tar)
12
+ if output == input:
13
+ break
14
+ input = output
15
+ return output
16
+
17
+ # function - RemoveUnwanted() - referenced in lines 63 - 109 of unified.y
18
+ def RemoveUnwanted(input : str) -> str:
19
+ # ignore punctuations
20
+ punctuationList = ["!",";",":","@","#","$","%","^","&","*",",",".","/","'","’","”","“","।", "]", "[", "×", "ñ", "∙","•"]
21
+
22
+ # replacing problematic unicode characters that look the same but have different encodings.
23
+ # punjabi update
24
+ replaceDict = {"ऩ":"ऩ", "ऱ":"ऱ", "क़":"क़", "ख़":"ख़", "ग़":"ग़", "ज़":"ज़", "ड़":"ड़", "ढ़":"ढ़", "ढ़":"ढ़", "फ़":"फ़", "य़":"य़", "ऴ":"ऴ",
25
+ "ொ":"ொ", "ோ":"ோ",
26
+ "ൊ":"ൊ", "ോ":"ോ", "ല്‍‌":"ൽ", "ള്‍":"ൾ", "ര്‍":"ർ", "ന്‍":"ൻ", "ണ്‍":"ൺ"}
27
+
28
+ output = ""
29
+ for c in input:
30
+ if c in punctuationList:
31
+ continue
32
+ output += c
33
+
34
+ for k in replaceDict.keys():
35
+ output = rec_replace(output, k, replaceDict[k])
36
+ return output
37
+
38
+ # function to replace GetFile in lines 132 - 156 of unified.y
39
+ # gives the filename according to language and type
40
+ def GetFile(g : GLOBALS, LangId : int, type : int) -> str:
41
+ fileName = g.rootPath
42
+
43
+ # return common file that contains the CPS mapping
44
+ if type == 0:
45
+ fileName += g.commonFile
46
+ #print("file",fileName)
47
+ return fileName
48
+
49
+ elif type == 1:
50
+ fileName += "dict/"
51
+
52
+ elif type == 2:
53
+ fileName += "rules/"
54
+
55
+ langIdNameMapping = { 1 : "malayalam", 2 : "tamil", 3 : "telugu",
56
+ 4 : "kannada", 5 : "hindi", 6 : "bengali",
57
+ 7 : "gujarathi", 8 : "odiya", 9 : "punjabi", 10 : "english" }
58
+
59
+ if LangId in langIdNameMapping.keys():
60
+ fileName += langIdNameMapping[LangId]
61
+
62
+ if type == 1:
63
+ fileName += ".dict"
64
+ elif type == 2:
65
+ fileName += ".rules"
66
+
67
+ return fileName
68
+
69
+ # function to replace SetlangId in lines 62-80 of unified.y
70
+ def SetlangId(g : GLOBALS, fl : str):
71
+ id = ord(fl)
72
+ if(id>=3328 and id<=3455):
73
+ g.currLang = g.MALAYALAM; #malayalam
74
+ elif(id>=2944 and id<=3055):
75
+ g.currLang = g.TAMIL; #tamil
76
+ elif(id>=3202 and id<=3311):
77
+ g.currLang = g.KANNADA; #KANNADA
78
+ elif(id>=3072 and id<=3198):
79
+ g.currLang = g.TELUGU; #telugu
80
+ elif(id>=2304 and id<=2431):
81
+ g.currLang = g.HINDI; #hindi
82
+ elif(id>=2432 and id<=2559):
83
+ g.currLang = g.BENGALI; #BENGALI
84
+ elif(id>=2688 and id<=2815):
85
+ g.currLang = g.GUJARATHI; #gujarathi
86
+ elif(id>=2816 and id<=2943):
87
+ g.currLang = g.ODIYA; #odia
88
+ elif(id>=2560 and id <= 2687): # punjabi
89
+ g.currLang = g.PUNJABI
90
+ elif(id>=64 and id<=123):
91
+ g.currLang = g.ENGLISH; #english
92
+
93
+ g.langId = g.currLang
94
+
95
+ if(g.langId < 5):
96
+ g.isSouth = 1
97
+ if(g.langId == 0):
98
+ print(f"UNKNOWN LANGUAGE - id = {fl}")
99
+ exit(0)
100
+ return 1
101
+
102
+ # replacement for function in lins 158 - 213. Sets the lanuage features
103
+ def SetlanguageFeat(g : GLOBALS, input : str) -> int:
104
+
105
+ # open common file
106
+ #print("entered here")
107
+ try:
108
+ with open(GetFile(g, 0,0), 'r') as infile:
109
+ lines = infile.readlines()
110
+ #print("linessss", lines)
111
+
112
+ except:
113
+ print("Couldn't open common file for reading")
114
+ return 0
115
+
116
+ str1 = input
117
+ length = len(str1)
118
+ if (length == 0):
119
+ length = 1
120
+
121
+ for j in range(0,length):
122
+ # for skipping invisible char
123
+ if (ord(str1[j]) < 8204):
124
+ firstLet = str1[j]
125
+ break
126
+
127
+ SetlangId(g, firstLet) # set global langId
128
+ for i in range(len(lines)):
129
+ l = lines[i].strip().split('\t')
130
+ g.symbolTable[i][1] = l[1]
131
+ g.symbolTable[i][0] = l[1 + g.langId]
132
+
133
+ return 1
134
+
135
+ # replacement for function in lines 52 - 59. Check if symbol is in symbolTable
136
+ def CheckSymbol(g : GLOBALS, input : str) -> int:
137
+ i = 0
138
+ for i in range(g.ROW):
139
+ if (g.symbolTable[i][1] == input):
140
+ return 1
141
+ return 0
142
+
143
+ # replacement for function in lines 249 - 276. Convert utf-8 to cps symbols
144
+ def ConvertToSymbols(g : GLOBALS, input : str) -> str:
145
+ str1 = input
146
+
147
+ g.words.syllabifiedWord = "&"
148
+ for j in range(len(str1)):
149
+ if (ord(str1[j]) < 8204):
150
+ g.words.syllabifiedWord += "&" + g.symbolTable[ord(str1[j])%128][1]
151
+
152
+ g.words.syllabifiedWord = g.words.syllabifiedWord[1:]
153
+ return g.words.syllabifiedWord
154
+
155
+ # function in lines 1278 - 1299. save answer in g.answer
156
+ def WriteFile(g : GLOBALS, text : str):
157
+ g.answer = f"(set! wordstruct '( {text}))"
158
+
159
+ # function in lines 588-597. checnk if vowel is in input. 'q' special case, 'rq' special case
160
+ def CheckVowel(input : str, q : int, rq : int) -> int:
161
+ if (input.find("a") != -1):
162
+ return 1
163
+ if (input.find("e") != -1):
164
+ return 1
165
+ if (input.find("i") != -1):
166
+ return 1
167
+ if (input.find("o") != -1):
168
+ return 1
169
+ if (input.find("u") != -1):
170
+ return 1
171
+ if (q and input.find("q") != -1):
172
+ return 1
173
+ if (rq and input.find("rq") != -1):
174
+ return 1
175
+ return 0
176
+
177
+ # function in lines 599-602.
178
+ def Checkeuv(input : str) -> int:
179
+ if (input.find("euv") != -1):
180
+ return 1
181
+ return 0
182
+
183
+ # function in lines 605-613
184
+ def CheckSingleVowel(input : str, q : int) -> int:
185
+ if (input in ['a', 'e', 'i', 'o', 'u']):
186
+ return 1
187
+ if (q != 0 and input == 'q'):
188
+ return 1
189
+ return 0
190
+
191
+ # function in lines 616 - 629. get the type of phone in the position
192
+ def GetPhoneType(g : GLOBALS, input : str, pos : int) -> int:
193
+ phone = input
194
+ phone = phone.split('&')
195
+ phone = list(filter(lambda x : x != '', phone))
196
+ pos = min(pos, len(phone))
197
+ pch = phone[pos - 1]
198
+
199
+ if (g.flags.DEBUG):
200
+ print(f'input : {input}')
201
+ print(f"str : {pch} {GetType(g, pch)}")
202
+
203
+ return GetType(g, pch)
204
+
205
+ # function in lines 631 - 637. get the type of given input
206
+ def GetType(g : GLOBALS, input : str):
207
+ for i in range(g.VOWELSSIZE):
208
+ if g.VOWELS[i] == input:
209
+ return 1
210
+ for i in range(g.CONSONANTSSIZE):
211
+ if g.CONSONANTS[i] == input:
212
+ return 2
213
+ for i in range(g.SEMIVOWELSSIZE):
214
+ if g.SEMIVOWELS[i] == input:
215
+ return 3
216
+ return 0
217
+
218
+ # function in lines 640 - 647. check if chillaksharas are there --for malayalam
219
+ def CheckChillu(input : str) -> int:
220
+ l = ["nwv", "nnv", "rwv", "lwv", "lnv"]
221
+ for x in l:
222
+ if (input.find(x) != -1):
223
+ return 1
224
+
225
+ return 0
226
+
227
+ # function in lines 650 - 660. get UTF-8 from CPS
228
+ def GetUTF(g : GLOBALS, input : str) -> str :
229
+ for i in range(g.ROW):
230
+ if (input == g.symbolTable[i][1]):
231
+ return g.symbolTable[i][0]
232
+
233
+ return 0
234
+
235
+ # function in lines 663 - 666. verify the letter is english char -- CLS
236
+ def isEngLetter(p : str) -> int:
237
+ if (ord(p) >= 97 and ord(p) <= 122):
238
+ return 1
239
+ return 0
240
+
241
+ # function in lines 669-682. remove unwanted Symbols from word
242
+ def CleanseWord(phone : str) -> str:
243
+ phonecopy = ""
244
+ for c in phone:
245
+ if (c != '&' and isEngLetter(c) == 0):
246
+ c = '#'
247
+ phonecopy += c
248
+ phonecopy = rec_replace(phonecopy, '$','')
249
+ phonecopy = rec_replace(phonecopy, '&&','&')
250
+ return phonecopy
251
+
252
+ # replacement for funciton in lines 321 - 356. Correct if there is a vowel in the middle
253
+ def MiddleVowel(g : GLOBALS, phone : str) -> str:
254
+
255
+ c1 = ''
256
+ c2 = ''
257
+ phonecopy = phone
258
+ for i in range(g.CONSONANTSSIZE):
259
+ for j in range(g.VOWELSSIZE):
260
+ c1 = f'&{g.CONSONANTS[i]}&{g.VOWELS[j]}&'
261
+ c2 = f'&{g.CONSONANTS[i]}&av&{g.VOWELS[j]}&'
262
+
263
+ phonecopy = phonecopy.replace(c1, c2)
264
+
265
+ for i in range(g.SEMIVOWELSSIZE):
266
+ for j in range(g.VOWELSSIZE):
267
+ c1 = f'&{g.SEMIVOWELS[i]}&{g.VOWELS[j]}&'
268
+ c2 = f'&{g.SEMIVOWELS[i]}&av&{g.VOWELS[j]}&'
269
+
270
+ phonecopy = phonecopy.replace(c1, c2)
271
+
272
+ return phonecopy
273
+
274
+ # replacement for function in lines 435 - 459. //cant use this as break syllable rules.
275
+ # NOT USED ANYWHERE
276
+ def DoubleModifierCorrection(phone : str) -> str:
277
+
278
+ doubleModifierList = ["&nwv&","&nnv&","&rwv&","&lwv&","&lnv&","&aav&","&iiv&","&uuv&","&rqv&","&eev&",
279
+ "&eiv&","&ouv&","&axv&","&oov&","&aiv&","&auv&","&aev&",
280
+ "&iv&","&ov&","&ev&","&uv&"]
281
+
282
+ phonecopy = phone
283
+ for i in range(0,21):
284
+ for j in range(0,21):
285
+ c1 = f'{doubleModifierList[i]}#{doubleModifierList[j]}'
286
+ c2 = f'{doubleModifierList[i]}{doubleModifierList[j]}#&'
287
+ phonecopy = phonecopy.replace(c1, c2)
288
+
289
+ phonecopy = rec_replace(phonecopy, "&#&hq&","&hq&#&")
290
+ phonecopy = rec_replace(phonecopy, "&&","&")
291
+ return phonecopy
292
+
293
+ # replacement for funciton in lines 462 - 495. //for eu&C&C&V
294
+ def SchwaDoubleConsonent(phone : str) -> str:
295
+ consonentList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","y","r","l","w","sh","sx","zh","y","s","h","f","dxq"]
296
+ vowelList = ["av&","nwv&","nnv&","rwv&","lwv&","lnv&","aav&","iiv&","uuv&","rqv&","eev&","eiv&","ouv&",
297
+ "axv&","oov&","aiv&","nnx&","nxx&","rrx&","llx&","lxx&",
298
+ "aa&","iv&","ov&","mq&","aa&","ii&","uu&","rq&",
299
+ "ee&","ei&","ou&","oo&","ax&","ai&","ev&","uv&",
300
+ "a&","e&","i&","o&","u&"]
301
+
302
+ phonecopy = phone
303
+ for i in range(0,39):
304
+ for j in range(0,39):
305
+ for k in range(0,42):
306
+ c1 = f'&euv&{consonentList[i]}&{consonentList[j]}&{vowelList[k]}'
307
+ c2 = f'&euv&{consonentList[i]}&av&{consonentList[j]}&{vowelList[k]}'
308
+ phonecopy = phonecopy.replace(c1, c2)
309
+ phonecopy = rec_replace(phonecopy, "$","")
310
+ return phonecopy
311
+
312
+ # replacement for function in lines 498 - 585. //halant specific correction for aryan langs
313
+ def SchwaSpecificCorrection(g : GLOBALS, phone : str) -> str:
314
+ schwaList = ["k","kh","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh",
315
+ "nx","t","th","d","dh","n","p","ph","b","bh","m","y",
316
+ "r","l","s","w","sh","sx","zh","h","lx","rx","f","dxq"]
317
+
318
+ vowelList = ["av&","nwv&","nnv&","rwv&","lwv&","lnv&","aav&","iiv&","uuv&","rqv&","eev&","eiv&","ouv&",
319
+ "axv&","oov&","aiv&","nnx&","nxx&","rrx&","llx&","lxx&",
320
+ "aa&","iv&","ov&","mq&","aa&","ii&","uu&","rq&",
321
+ "ee&","ei&","ou&","oo&","ax&","ai&","ev&","uv&",
322
+ "a&","e&","i&","o&","u&"]
323
+
324
+ if (g.flags.DEBUG):
325
+ print(f'{len(phone)}')
326
+
327
+ phonecopy = phone + '!'
328
+
329
+ if (g.flags.DEBUG):
330
+ print(f'phone cur - {phonecopy}')
331
+
332
+ # // for end correction &av&t&aav&. //dont want av
333
+ for i in range(0,38):
334
+ for j in range(1,42):
335
+ c1 = f'&av&{schwaList[i]}&{vowelList[j]}!'
336
+ c2 = f'&euv&{schwaList[i]}&{vowelList[j]}!'
337
+ phonecopy = phonecopy.replace(c1, c2)
338
+
339
+ phonecopy = rec_replace(phonecopy, '!', '')
340
+
341
+ for i in range(0,38):
342
+ c1 = f'&av&{schwaList[i]}&av&'
343
+ c2 = f'&euv$&{schwaList[i]}&av$&'
344
+ phonecopy = phonecopy.replace(c1, c2)
345
+
346
+ if(g.flags.DEBUG):
347
+ print(f"inside schwa {phonecopy}")
348
+
349
+ for i in range(0,38):
350
+ c1 = f'&av&{schwaList[i]}&'
351
+ c3 = f'&{schwaList[i]}&'
352
+
353
+ for j in range(0,41):
354
+ c4 = f'&euv&{c3}${vowelList[j]}'
355
+ c2 = f'{c1}{vowelList[j]}'
356
+ phonecopy = phonecopy.replace(c2, c4)
357
+
358
+ phonecopy = rec_replace(phonecopy, '$', '')
359
+
360
+ #//&q&w&eu& - CORRECTED TO 38 - CHECK
361
+ for i in range(0,38):
362
+ c1 = f'&q&{schwaList[i]}&euv&'
363
+ c2 = f'&q&{schwaList[i]}&av&'
364
+ phonecopy = phonecopy.replace(c1, c2)
365
+
366
+ return phonecopy
367
+
368
+ # replacement for function in lines . //correct the geminate syllabification ,isReverse --reverse correction
369
+ def GeminateCorrection(phone : str, isReverse : int) -> str:
370
+ geminateList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","y",
371
+ "r","l","w","sh","sx","zh","y","s","h","f","dxq"]
372
+
373
+ phonecopy = phone
374
+ for i in range(0, 39):
375
+ c1 = f'&{geminateList[i]}&eu&{geminateList[i]}&'
376
+ c2 = f'&{geminateList[i]}&{geminateList[i]}&'
377
+ phonecopy = rec_replace(phonecopy, c2, c1) if isReverse != 0 else rec_replace(phonecopy, c1, c2)
378
+
379
+ return phonecopy
380
+
381
+ # replacement for function in lines 356 - 430. //Syllabilfy the words
382
+ def Syllabilfy(phone : str) -> str:
383
+
384
+ phonecopy = phone
385
+ phonecopy = rec_replace(phonecopy, "&&","&")
386
+ phonecopy = phonecopy.replace("&eu&","&eu&#&")
387
+ phonecopy = phonecopy.replace("&euv&","&euv&#&")
388
+ phonecopy = rec_replace(phonecopy, "&avq","&q&av")
389
+ phonecopy = phonecopy.replace("&av&","&av&#&")
390
+ phonecopy = phonecopy.replace("&q","&q&#")
391
+
392
+ removeList = ["&nwv&","&nnv&","&rwv&","&lwv&","&lnv&","&aav&","&iiv&","&uuv&","&rqv&","&eev&",
393
+ "&eiv&","&ouv&","&axv&","&oov&","&aiv&","&auv&","&aev&",
394
+ "&nnx&","&nxx&","&rrx&","&llx&","&lxx&",
395
+ "&aa&","&iv&","&ov&","&mq&","&aa&","&ii&","&uu&","&rq&","&au&","&ee&",
396
+ "&ei&","&ou&","&oo&","&ax&","&ai&","&ev&","&uv&","&ae&",
397
+ "&a&","&e&","&i&","&o&","&u&"]
398
+
399
+ for i in range(0,45):
400
+ c1 = removeList[i]
401
+ c2 = c1 + '#&'
402
+ phonecopy = phonecopy.replace(c1, c2)
403
+ phonecopy = rec_replace(phonecopy, "&#&hq&","&hq&#&")
404
+
405
+ # //for vowel in between correction
406
+ pureVowelList = ["&a&","&e&","&i&","&o&","&u&"]
407
+ for i in range(0,5):
408
+ c1 = f'&#{pureVowelList[i]}'
409
+ phonecopy = phonecopy.replace(pureVowelList[i], c1)
410
+
411
+ consonantList = ["k","kh","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh",
412
+ "nx","t","th","d","dh","n","p","ph","b","bh","m","y",
413
+ "r","l","w","sh","sx","zh","y","s","h","lx","rx","f","dxq"]
414
+
415
+ # // &eu&#&r&eu&#& syllabification correction
416
+
417
+ for i in range(0,39):
418
+ c1 = f'&eu&#&{consonantList[i]}&euv&#&'
419
+ c2 = f'&eu&{consonantList[i]}&av&#&'
420
+ phonecopy = phonecopy.replace(c1, c2)
421
+
422
+ for i in range(0,39):
423
+ c1 = f'&euv&#&{consonantList[i]}&euv&#&'
424
+ c2 = f'&euv&{consonantList[i]}&av&#&'
425
+ phonecopy = phonecopy.replace(c1, c2)
426
+
427
+ phonecopy = phonecopy.replace("&eu&","&eu&#&")
428
+ return phonecopy
429
+
430
+ # replacement for function in lines 279 - 317. //check the word in Dict.
431
+ # REMOVED EXIT(1) ON ENGLISH. WAS USELESS
432
+ def CheckDictionary(g : GLOBALS, input : str) -> int:
433
+
434
+ fileName = GetFile(g, g.langId, 1)
435
+ if (g.flags.DEBUG):
436
+ print(f'dict : {fileName}')
437
+ try:
438
+ with open(fileName, 'r') as output:
439
+ cnts = output.readlines()
440
+ except:
441
+ if g.flags.DEBUG:
442
+ print(f'Dict not found')
443
+ if(g.langId == g.ENGLISH):
444
+ exit(1)
445
+ return 0
446
+
447
+ if (g.langId == g.ENGLISH):
448
+ input1 = ''
449
+ for c in input:
450
+ if ord(c) < 97:
451
+ c = c.lower()
452
+ input1 += c
453
+ input = input1
454
+
455
+ for l in cnts:
456
+ l = l.strip().split('\t')
457
+ assert(len(l) == 3)
458
+ if g.flags.DEBUG:
459
+ print(f"word : {l[0]}")
460
+ if input == l[0]:
461
+ if g.flags.DEBUG:
462
+ print(f"match found")
463
+ print(f'Syllables : {l[1]}')
464
+ print(f'monophones : {l[2]}')
465
+ if g.flags.writeFormat == 1:
466
+ WriteFile(g, l[1])
467
+ if g.flags.writeFormat == 0:
468
+ WriteFile(g, l[2])
469
+ return 1
470
+
471
+ return 0
472
+
473
+ # replacement for function in lines 801-821.
474
+ def PositionCorrection(phone : str, left : str, right :str, isReverse:int) -> str:
475
+ geminateList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh",
476
+ "n","p","ph","b","bh","m","y","r","l","w","sh","sx","zh","y","s","h","f","dxq"]
477
+ phonecopy = phone
478
+ for i in range(0,39):
479
+ c1 = left
480
+ c2 = right
481
+ c1 = c1.replace('@', geminateList[i])
482
+ c2 = c2.replace('@', geminateList[i])
483
+ phonecopy = rec_replace(phonecopy, c2, c1) if isReverse != 0 else rec_replace(phonecopy, c1, c2)
484
+ return phonecopy
485
+
486
+ # replacement for function in lines 711 - 713.
487
+ def CountChars(s : str, c : str) -> int:
488
+ count = 0
489
+ for x in s:
490
+ if x == c:
491
+ count += 1
492
+ return count
493
+
494
+ # replacement for function in lines 719 - 744.
495
+ def GenerateAllCombinations(g : GLOBALS, j : int, s : str, c : list, isRight : int):
496
+ t = ''
497
+ if (c[j][0][0] == '#'):
498
+ if isRight == 1:
499
+ g.combvars.rightStr[g.combvars.bi] = s + '&'
500
+ g.combvars.bi += 1
501
+ else:
502
+ g.combvars.leftStr[g.combvars.bi] = s + '&'
503
+ g.combvars.bi += 1
504
+ else:
505
+ i = 0
506
+ while (c[j][i][0] != '#'):
507
+ t = s + '&' + c[j][i]
508
+ GenerateAllCombinations(g, j+1, t, c, isRight)
509
+ i += 1
510
+
511
+ # replacement for function in lines 746 - 768.
512
+ def GenerateMatrix(g : GLOBALS, combMatrix : list, regex : str):
513
+ row, col, item = 0, 0, 0
514
+ for i in range(0, len(regex)):
515
+ if regex[i] == '&':
516
+ combMatrix[row][col+1] = '#'
517
+ row += 1
518
+ col = 0
519
+ item = 0
520
+ combMatrix[row][col] = ''
521
+ elif regex[i] == '|':
522
+ col += 1
523
+ item = 0
524
+ combMatrix[row][col] = ''
525
+ else:
526
+ combMatrix[row][col] = combMatrix[row][col][:item] + regex[i] + combMatrix[row][col][(item+1):]
527
+ item += 1
528
+ if g.flags.DEBUG:
529
+ print(f'{row} {col} {combMatrix[row][col]}')
530
+
531
+ combMatrix[row][col+1] = '#'
532
+ combMatrix[row+1][0] = '#'
533
+
534
+ # replacement for function in lines 770 - 799.
535
+ def CombinationCorrection(g : GLOBALS, phone : str, left : str, right : str, isReverse : int) -> str:
536
+ leftComb = [['' for _ in range(256)] for _ in range(256)]
537
+ rightComb = [['' for _ in range(256)] for _ in range(256)]
538
+ GenerateMatrix(g, leftComb, left)
539
+ GenerateMatrix(g, rightComb, right)
540
+
541
+ g.combvars.bi = 0
542
+ GenerateAllCombinations(g, 0, '', leftComb, 0)
543
+ g.combvars.bi = 0
544
+ GenerateAllCombinations(g, 0, '', rightComb, 1)
545
+
546
+ i = 0
547
+ phonecopy = phone
548
+ while g.combvars.leftStr[i] != '':
549
+ if isReverse != 0:
550
+ phonecopy = phonecopy.replace(g.combvars.rightStr[i], g.combvars.leftStr[i])
551
+ else:
552
+ phonecopy = phonecopy.replace(g.combvars.leftStr[i], g.combvars.rightStr[i])
553
+
554
+ if g.flags.DEBUG:
555
+ print(f'{g.combvars.leftStr[i]} {g.combvars.rightStr[i]}')
556
+
557
+ i += 1
558
+
559
+ g.combvars.refresh()
560
+ return phonecopy
561
+
562
+ # replacement for function in lines 825 - 930. //Language specific corrections
563
+ def LangSpecificCorrection(g : GLOBALS, phone : str, langSpecFlag : int) -> str:
564
+ phonecopy = phone
565
+ if g.isSouth:
566
+ phonecopy = rec_replace(phonecopy,"&ei&","&ai&")
567
+ phonecopy = rec_replace(phonecopy,"&eiv&","&aiv&")
568
+ else:
569
+ phonecopy = rec_replace(phonecopy,"&oo&","&o&")
570
+ phonecopy = rec_replace(phonecopy,"&oov&","&ov&")
571
+
572
+ phonecopy = phonecopy.replace("&q&","&av&q&")
573
+ phonecopy = rec_replace(phonecopy, "&a&av&","&a&")
574
+ phonecopy = rec_replace(phonecopy, "&e&av&","&e&")
575
+ phonecopy = rec_replace(phonecopy, "&i&av&","&i&")
576
+ phonecopy = rec_replace(phonecopy, "&o&av&","&o&")
577
+ phonecopy = rec_replace(phonecopy, "&u&av&","&u&")
578
+ phonecopy = rec_replace(phonecopy,"&a&rqv&","&rq&")
579
+ phonecopy = rec_replace(phonecopy,"&aa&av&","&aa&")
580
+ phonecopy = rec_replace(phonecopy,"&ae&av&","&ae&")
581
+ phonecopy = rec_replace(phonecopy,"&ax&av&","&ax&")
582
+ phonecopy = rec_replace(phonecopy,"&ee&av&","&ee&")
583
+ phonecopy = rec_replace(phonecopy,"&ii&av&","&ii&")
584
+ phonecopy = rec_replace(phonecopy,"&ai&av&","&ai&")
585
+ phonecopy = rec_replace(phonecopy,"&au&av&","&au&")
586
+ phonecopy = rec_replace(phonecopy,"&oo&av&","&oo&")
587
+ phonecopy = rec_replace(phonecopy,"&uu&av&","&uu&")
588
+ phonecopy = rec_replace(phonecopy,"&rq&av&","&rq&")
589
+ phonecopy = rec_replace(phonecopy,"&av&av&","&av&")
590
+ phonecopy = rec_replace(phonecopy,"&ev&av&","&ev&")
591
+ phonecopy = rec_replace(phonecopy,"&iv&av&","&iv&")
592
+ phonecopy = rec_replace(phonecopy,"&ov&av&","&ov&")
593
+ phonecopy = rec_replace(phonecopy,"&uv&av&","&uv&")
594
+
595
+ phonecopy = rec_replace(phonecopy, "&av&rqv&","&rqv&")
596
+ phonecopy = rec_replace(phonecopy, "&aav&av&","&aav&")
597
+ phonecopy = rec_replace(phonecopy, "&aev&av&","&aev&")
598
+ phonecopy = rec_replace(phonecopy, "&auv&av&","&auv&")
599
+ phonecopy = rec_replace(phonecopy, "&axv&av&","&axv&")
600
+ phonecopy = rec_replace(phonecopy, "&aiv&av&","&aiv&")
601
+ phonecopy = rec_replace(phonecopy, "&eev&av&","&eev&")
602
+ phonecopy = rec_replace(phonecopy, "&eiv&av&","&eiv&")
603
+ phonecopy = rec_replace(phonecopy, "&iiv&av&","&iiv&")
604
+ phonecopy = rec_replace(phonecopy, "&oov&av&","&oov&")
605
+ phonecopy = rec_replace(phonecopy, "&ouv&av&","&ouv&")
606
+ phonecopy = rec_replace(phonecopy, "&uuv&av&","&uuv&")
607
+ phonecopy = rec_replace(phonecopy, "&rqv&av&","&rqv&")
608
+
609
+ if langSpecFlag == 0:
610
+ return phonecopy
611
+
612
+ fileName = GetFile(g, g.langId, 2)
613
+ with open(fileName, 'r') as output:
614
+ cnts = output.readlines()
615
+
616
+ left = ''
617
+ right = ''
618
+ phonecopy = '^' + phonecopy + '$'
619
+
620
+ if (g.flags.DEBUG):
621
+ print(f'phone : {phonecopy}')
622
+
623
+ for l in cnts:
624
+ l = l.strip()
625
+ if (l.find('#') != -1):
626
+ continue
627
+
628
+ l = l.split('\t')
629
+ assert(len(l) == 2)
630
+ left, right = l[0], l[1]
631
+
632
+ if left.find('|') != -1:
633
+ a1 = left[1:-1]
634
+ a2 = right[1:-1]
635
+ phonecopy = CombinationCorrection(g, phonecopy, a1, a2, 0)
636
+ if g.flags.DEBUG:
637
+ print(f'{a1}\t{a2}')
638
+ elif left.find('@') != -1:
639
+ phonecopy = PositionCorrection(phonecopy, left, right, 0)
640
+ else:
641
+ phonecopy = phonecopy.replace(left, right)
642
+
643
+ # //remove head and tail in phone
644
+ phonecopy = phonecopy.replace('^', '')
645
+ phonecopy = phonecopy.replace('$', '')
646
+ # //end correction
647
+ count = 0
648
+ for i in range(len(phonecopy)):
649
+ if phonecopy[i] == '&':
650
+ count = i
651
+ return phonecopy[:(count+1)]
652
+
653
+ # Replacement for function in lines 934 - 991. //Reverse syllable correction for syllable parsing
654
+ def SyllableReverseCorrection(g : GLOBALS, phone : str, langSpecFlag : int) -> str:
655
+ phonecopy = phone
656
+
657
+ if g.isSouth:
658
+ phonecopy = rec_replace(phonecopy, "&ai&","&ei&")
659
+ phonecopy = rec_replace(phonecopy, "&aiv&","&eiv&")
660
+ else:
661
+ phonecopy = rec_replace(phonecopy, "&o&","&oo&")
662
+ phonecopy = rec_replace(phonecopy, "&ov&","&oov&")
663
+
664
+ if langSpecFlag == 0:
665
+ return phonecopy
666
+
667
+ fileName = GetFile(g, g.langId, 2)
668
+ with open(fileName, 'r') as output:
669
+ cnts = output.readlines()
670
+
671
+ left = ''
672
+ right = ''
673
+ # //update head and tail in phone
674
+ phonecopy = '^' + phonecopy + '$'
675
+
676
+ if g.flags.DEBUG:
677
+ print(f'before phone : {phonecopy}')
678
+
679
+ for l in cnts:
680
+ l = l.strip()
681
+ if (l.find('#') != -1):
682
+ continue
683
+
684
+ l = l.split('\t')
685
+ assert(len(l) == 2)
686
+ left, right = l[0], l[1]
687
+
688
+ if left.find('|') != -1:
689
+ a1 = left[1:-1]
690
+ a2 = right[1:-1]
691
+ phonecopy = CombinationCorrection(g, phonecopy, a1, a2, 1)
692
+ if g.flags.DEBUG:
693
+ print(f'{a1}\t{a2}')
694
+ elif left.find('@') != -1:
695
+ phonecopy = PositionCorrection(phonecopy, left, right, 1)
696
+ else:
697
+ phonecopy = phonecopy.replace(right, left)
698
+
699
+ # //remove head and tail in phone
700
+ phonecopy = phonecopy.replace('^', '')
701
+ phonecopy = phonecopy.replace('$', '')
702
+ # //end correction
703
+ if (g.flags.DEBUG):
704
+ print(f'after phone : {phonecopy}')
705
+ return phonecopy
706
+
707
+ # //language specific syllable correction
708
+ def LangSyllableCorrection(input : str) -> int:
709
+ if input == "&av&q&":
710
+ return 1
711
+ else:
712
+ return 0
713
+
714
+ # replacement for function in lines 1000 - 1160. //split into syllable array
715
+ def SplitSyllables(g : GLOBALS, input : str) -> int:
716
+ incopy = input
717
+
718
+ if g.flags.writeFormat == 2:
719
+ i = 0
720
+ j = 0
721
+ fullList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","y","r","l","w","sh","sx","zh","y","s","h","f","dxq"]
722
+
723
+ for i in range(0,39):
724
+ for j in range(0,39):
725
+ c1 = f'&{fullList[i]}&{fullList[j]}&'
726
+ c2 = f'&{fullList[i]}&euv&#&{fullList[j]}&'
727
+ incopy = incopy.replace(c1, c2)
728
+
729
+ incopy = rec_replace(incopy, "&#&mq&","&mq&")
730
+ incopy = rec_replace(incopy, "&#&q&","&q&")
731
+
732
+ pch = incopy.split('#')
733
+ g.syllableList = []
734
+ for c in pch:
735
+ if c != '&':
736
+ g.syllableList.append(c)
737
+
738
+ # ln -> len
739
+ ln = len(g.syllableList)
740
+ if (ln == 0):
741
+ return 1
742
+
743
+ if g.flags.DEBUG:
744
+ for i in range(ln):
745
+ print(f"initStack : {g.syllableList[i]}")
746
+
747
+ # //south specific av addition
748
+ if CheckVowel(g.syllableList[ln-1],1,0) == 0 and CheckChillu(g.syllableList[ln-1]) == 0:
749
+ if g.isSouth:
750
+ g.syllableList[ln-1] += '&av&'
751
+ else:
752
+ g.syllableList[ln-1] += '&euv&'
753
+
754
+ # //round 2 correction
755
+ if g.flags.writeFormat == 2:
756
+ g.syllableCount = ln
757
+ g.flags.writeFormat = 1
758
+ return 1
759
+
760
+ euFlag = 1
761
+ if ln > 1:
762
+ for i in range(ln-1,-1,-1):
763
+ if LangSyllableCorrection(g.syllableList[i]) == 1:
764
+ g.syllableList[i-1] += g.syllableList[i]
765
+ g.syllableList[i] = ''
766
+
767
+ if g.syllableList[i].find("&eu&") != -1:
768
+ g.syllableList[i] = g.syllableList[i].replace("&eu&", "!")
769
+ euFlag = 1
770
+
771
+ if g.syllableList[i].find("&euv&") != -1:
772
+ g.syllableList[i] = g.syllableList[i].replace("&euv&", "!")
773
+ euFlag = 2
774
+
775
+ if CheckVowel(g.syllableList[i],0,1) == 0:
776
+ if i-1 >= 0:
777
+ g.syllableList[i-1] += g.syllableList[i]
778
+ g.syllableList[i] = ''
779
+ else:
780
+ g.syllableList[i] += g.syllableList[i+1]
781
+ g.syllableList[i+1] = ''
782
+
783
+ if i-1 > 0:
784
+ if euFlag == 1:
785
+ g.syllableList[i-1] = g.syllableList[i-1].replace("!","&eu&")
786
+ elif euFlag == 2:
787
+ g.syllableList[i-1] = g.syllableList[i-1].replace("!","&euv&")
788
+ g.syllableList[i-1] = rec_replace(g.syllableList[i-1], "&&","&")
789
+
790
+ if euFlag == 1:
791
+ g.syllableList[i] = g.syllableList[i].replace("!","&eu&")
792
+ elif euFlag == 2:
793
+ g.syllableList[i] = g.syllableList[i].replace("!","&euv&")
794
+ else:
795
+ if (CheckVowel(g.syllableList[0],1,0) == 0 and g.flags.writeFormat != 3) or Checkeuv(g.syllableList[0]) != 0:
796
+ g.syllableList[0] += '&av'
797
+
798
+ if g.flags.DEBUG:
799
+ for i in range(ln):
800
+ print(f'syllablifiedStack : {g.syllableList[i]}')
801
+
802
+ # //round 3 double syllable correction
803
+ for i in range(ln):
804
+ # //corrections
805
+ g.syllableList[i] = g.syllableList[i].replace('1','')
806
+ if g.flags.DEBUG:
807
+ print(f'LenStack : {len(g.syllableList[i])}')
808
+
809
+ if len(g.syllableList[i]) > 0:
810
+ if g.syllableList[i].find("&eu&") != -1:
811
+ g.syllableList[i] = g.syllableList[i].replace("&eu&", "!")
812
+ euFlag = 1
813
+
814
+ if g.syllableList[i].find("&euv&") != -1:
815
+ g.syllableList[i] = g.syllableList[i].replace("&euv&", "!")
816
+ euFlag = 2
817
+
818
+ if CheckVowel(g.syllableList[i],0,1) == 0 and g.flags.writeFormat != 3:
819
+ if g.flags.DEBUG:
820
+ print(f'Stack : {g.syllableList[i]}')
821
+ g.syllableList[i] += '&av'
822
+
823
+ if g.syllableList[i].find('!') != -1:
824
+ if euFlag == 1:
825
+ g.syllableList[i] = g.syllableList[i].replace("!","&eu&")
826
+ elif euFlag == 2:
827
+ g.syllableList[i] = g.syllableList[i].replace("!","&euv&")
828
+ g.syllableList[i] = g.syllableList[i].replace('!', 'eu')
829
+
830
+ g.syllableList[i] = rec_replace(g.syllableList[i], '&&', '&')
831
+ g.syllableList[i] = GeminateCorrection(g.syllableList[i],1)
832
+
833
+ if g.flags.DEBUG:
834
+ for i in range(ln):
835
+ print(f'syllablifiedStack1 : {g.syllableList[i]}')
836
+ print(f'No of syllables : {ln}')
837
+
838
+ g.syllableCount = ln
839
+ if g.flags.writeFormat == 3:
840
+ g.flags.writeFormat = 0
841
+ return 1
842
+
843
+ # replacement for function in lines 1164 - 1275. //make to write format
844
+ def WritetoFiles(g : GLOBALS) -> int:
845
+ if g.flags.DEBUG:
846
+ for i in range(0,g.syllableCount):
847
+ print(f'syllablifiedStackfinal : {g.syllableList[i]}')
848
+
849
+ validSyllable = 0
850
+ for i in range(0,g.syllableCount):
851
+ if g.syllableList[i] != '':
852
+ validSyllable += 1
853
+
854
+ if g.flags.DEBUG:
855
+ print(f'a correction {g.syllableList[0]}')
856
+
857
+ g.words.outputText = ''
858
+
859
+ # //phone
860
+ j = 0
861
+ if g.flags.writeFormat == 0:
862
+ syllablesPrint = 0
863
+ for i in range(g.syllableCount):
864
+ g.words.outputText += '(( '
865
+ l = g.syllableList[i].split('&')
866
+ for pch in l:
867
+ if pch == '':
868
+ continue
869
+ if g.flags.DEBUG:
870
+ print(f'syl {pch}')
871
+ j = 1
872
+ g.words.outputText += f'"{pch}" '
873
+ if j != 0:
874
+ if g.flags.syllTagFlag != 0:
875
+ if syllablesPrint == 0:
876
+ g.words.outputText += '_beg'
877
+ elif syllablesPrint == validSyllable - 1:
878
+ g.words.outputText += '_end'
879
+ else:
880
+ g.words.outputText += '_mid'
881
+ syllablesPrint += 1
882
+ g.words.outputText += ') 0) '
883
+ else:
884
+ g.words.outputText = g.words.outputText[:(len(g.words.outputText) - 3)]
885
+ j = 0
886
+
887
+ g.words.outputText = g.words.outputText.replace('v', '')
888
+ g.words.outputText = g.words.outputText.replace(" \"eu\"","")
889
+ g.words.outputText = g.words.outputText.replace('!', '')
890
+
891
+ # //syllable
892
+ elif g.flags.writeFormat == 1:
893
+ syllablesPrint = 0
894
+ for i in range(g.syllableCount):
895
+ g.syllableList[i] = rec_replace(g.syllableList[i], 'euv', 'eu')
896
+ g.syllableList[i] = SyllableReverseCorrection(g, g.syllableList[i], g.flags.LangSpecificCorrectionFlag)
897
+ if g.flags.DEBUG:
898
+ print(f'{g.syllableList[i]}')
899
+ g.words.outputText += '(( "'
900
+ l = g.syllableList[i].split('&')
901
+ for pch in l:
902
+ if pch == '':
903
+ continue
904
+ if g.flags.DEBUG:
905
+ print(f'syl {pch}')
906
+ j = 1
907
+ if CheckSymbol(g, pch) != 0:
908
+ g.words.outputText += GetUTF(g, pch)
909
+ if pch == 'av' and g.flags.DEBUG:
910
+ print('av found')
911
+ if j != 0:
912
+ if g.flags.syllTagFlag != 0:
913
+ if syllablesPrint == 0:
914
+ g.words.outputText += '_beg'
915
+ elif syllablesPrint == validSyllable - 1:
916
+ g.words.outputText += '_end'
917
+ else:
918
+ g.words.outputText += '_mid'
919
+ syllablesPrint += 1
920
+ g.words.outputText += '" ) 0) '
921
+ else:
922
+ g.words.outputText = g.words.outputText[:(len(g.words.outputText) - 4)]
923
+ j = 0
924
+
925
+ g.words.outputText = g.words.outputText.replace('#', '')
926
+ g.words.outputText = g.words.outputText.replace(' ', ' ')
927
+ if g.flags.DEBUG:
928
+ print(f'Print text : {g.words.outputText}')
929
+
930
+ WriteFile(g, g.words.outputText)
931
+ return 1
932
+
933
+
934
+ def load_mapping_file(g: GLOBALS):
935
+ # open common file
936
+ try:
937
+ # print('1.entered')
938
+ with open("/speech/utkarsh/tts_api/Unified_parser/common_hindi.map", 'r') as infile:
939
+ lines = infile.readlines()
940
+ # print(lines)
941
+ except:
942
+ print("Couldn't open common file for reading")
943
+ return 0
944
+
945
+ table=[]
946
+ for i in range(len(lines)):
947
+ l = lines[i].strip().split('\t')
948
+ table.append(l)
949
+
950
+ # g.symbolTable[i][1] = l[1]
951
+ # g.symbolTable[i][0] = l[1 + g.langId]
952
+
953
+ return table
954
+
955
+ def set_lang_id(language):
956
+ if language == "malayalam":
957
+ lang_id=1
958
+ elif language == "tamil":
959
+ lang_id=2
960
+ elif language == "telugu":
961
+ lang_id=3
962
+ elif language == "kannada":
963
+ lang_id=4
964
+ elif language == "hindi":
965
+ lang_id=5
966
+ elif language == "bengali":
967
+ lang_id=6
968
+ elif language == "gujrathi":
969
+ lang_id=7
970
+ elif language == "odiya":
971
+ lang_id=8
972
+ elif language == "punjabi":
973
+ lang_id=9
974
+ return lang_id
975
+
976
+
977
+ def convert_to_main_lang(g : GLOBALS,input_str,final_lang:str):
978
+ s= input_str
979
+ final_lang = "telugu"
980
+ # print("input_str:",input_str)
981
+ final_lang_id=set_lang_id(final_lang)
982
+ c=1
983
+ # print(s,final_lang_id)
984
+ temp_string=''
985
+ new_string='&'
986
+ table=load_mapping_file(g)
987
+ # print(final_lang_id)
988
+ # print(table)
989
+ for i in range(1,len(s)):
990
+ if s[i]=="&":
991
+ c=1
992
+ continue
993
+ if c==1:
994
+ temp_string+=s[i]
995
+ if s[i+1]=="&":
996
+ c=0
997
+ # print("new_string_1:",new_string)
998
+ # print("old_string_1:",temp_string)
999
+ if temp_string=="#":
1000
+ new_string+=temp_string+"&"
1001
+ temp_string=''
1002
+ continue
1003
+ if temp_string =='av':
1004
+ new_string+=temp_string+"&"
1005
+ temp_string=''
1006
+ # print("new_string_1-av/aiv:",new_string)
1007
+ continue
1008
+ if temp_string =='eu' or temp_string =='euv'or temp_string =='aiv':
1009
+ new_string+=temp_string+"&"
1010
+ # print("new_string_1-eu:",new_string)
1011
+ # print("old_string_1-euv:",s)
1012
+ temp_string=''
1013
+ continue
1014
+
1015
+ # print("new_string_before_table:",new_string)
1016
+ # print("old_string_before_table:",s)
1017
+ for j in range(len(table)):
1018
+ if table[j][1]==temp_string:
1019
+ # print("2:",table[j][1],temp_string)
1020
+ # print("3:",table[j][final_lang_id+1],ord(table[j][final_lang_id+1][0]))
1021
+ if ord(table[j][final_lang_id+1][0]) < 122:
1022
+ new_string=new_string+table[j][final_lang_id+1]+"&"
1023
+ temp_string=''
1024
+ # print("new string_2:",new_string)
1025
+ break
1026
+ else:
1027
+ new_string+=temp_string+"&"
1028
+ # print("new string_3:",new_string)
1029
+ temp_string=''
1030
+ break
1031
+ return new_string
Unified_parser/ply/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # PLY package
2
+ # Author: David Beazley ([email protected])
3
+ # https://github.com/dabeaz/ply
4
+
5
+ __version__ = '2022_01_02'
Unified_parser/ply/__pycache__/__init__.cpython-310.pyc ADDED
Binary file (196 Bytes). View file
 
Unified_parser/ply/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (213 Bytes). View file
 
Unified_parser/ply/__pycache__/__init__.cpython-37.pyc ADDED
Binary file (168 Bytes). View file
 
Unified_parser/ply/__pycache__/__init__.cpython-38.pyc ADDED
Binary file (168 Bytes). View file
 
Unified_parser/ply/__pycache__/lex.cpython-310.pyc ADDED
Binary file (5.17 kB). View file
 
Unified_parser/ply/__pycache__/lex.cpython-311.pyc ADDED
Binary file (6.76 kB). View file
 
Unified_parser/ply/__pycache__/lex.cpython-37.pyc ADDED
Binary file (5.16 kB). View file
 
Unified_parser/ply/__pycache__/lex.cpython-38.pyc ADDED
Binary file (5.2 kB). View file
 
Unified_parser/ply/__pycache__/yacc.cpython-310.pyc ADDED
Binary file (40.8 kB). View file
 
Unified_parser/ply/__pycache__/yacc.cpython-311.pyc ADDED
Binary file (82.8 kB). View file
 
Unified_parser/ply/__pycache__/yacc.cpython-37.pyc ADDED
Binary file (41.3 kB). View file
 
Unified_parser/ply/__pycache__/yacc.cpython-38.pyc ADDED
Binary file (41.2 kB). View file
 
Unified_parser/ply/lex.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import NamedTuple
2
+ import re
3
+
4
+ def t_kaki_c(t):
5
+ r'(&)*(dxhq|txh|khq|dxq|dxh|zh|tx|th|sx|sh|rx|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)((&)(dxhq|txh|khq|dxq|dxh|zh|tx|th|sx|sh|rx|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|ex|dx|dh|ch|bh|z|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b))*'
6
+ s = t
7
+ ans = ''
8
+ i = 1
9
+ if s[0] == '&':
10
+ ans += '&'
11
+ l = s.split('&')
12
+ for pch in l:
13
+ if pch == '':
14
+ continue
15
+ ans += f'{pch}&av&#&&'
16
+ i += 1
17
+ ans = ans[:(len(ans) - 7)]
18
+ return ans
19
+
20
+ def t_conjsyll2_c(t):
21
+ r'(eu)'
22
+ return 'eu&#'
23
+
24
+ def t_fullvowel_b(t):
25
+ r'(&)*(k|kh|g|gh|c|ch|j|jh|ng|nj|tx|txh|dx|dxh|nx|t|th|d|dh|n|p|ph|b|bh|m|y|r|l|w|sh|sx|s|lx|h|kq|khq|gq|z|dxq|dxhq|f|y)(&)(uu&mq|uu&hq|rq&mq|rq&hq|ou&mq|ou&hq|ii&mq|ii&hq|ei&mq|ei&hq|ee&mq|ee&hq|aa&mq|aa&hq|uu&q|u&mq|u&hq|rq&q|ou&q|o&mq|o&hq|ii&q|i&mq|i&hq|ei&q|ee&q|aa&q|a&mq|a&hq|u&q|o&q|i&q|a&q|uu|rq|ou|ii|ei|ee|ax|aa|u|o|i|a)'
26
+ return t
27
+
28
+ def t_kaki_a(t):
29
+ r'(&)*(dxhq|txh|khq|dxq|dxh|tx|th|sx|sh|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)(&)(uuv|rqv|ouv|iiv|eiv|eev|aev|aav|uv|ov|mq|iv|hq|ax|q)(&)(mq|hq|q)*'
30
+ return t
31
+
32
+ def t_kaki_b(t):
33
+ r'(&)*(dxq&uuv|dxq&rqv|dxq&ouv|dxq&iiv|dxq&eiv|dxq&eev|dxq&aav|dxq&uv|dxq&ov|dxq&mq|dxq&iv|dxq&hq|dxq&q|dxq)'
34
+ return t
35
+
36
+ def t_conjsyll2_b(t):
37
+ r'(&)*(txh&eu|dxh&eu|tx&eu|th&eu|sx&eu|sh&eu|ph&eu|nx&eu|nj&eu|ng&eu|lx&eu|kh&eu|jh&eu|gh&eu|dx&eu|dh&eu|ch&eu|bh&eu|y&eu|w&eu|t&eu|s&eu|r&eu|p&eu|n&eu|m&eu|l&eu|k&eu|j&eu|h&eu|g&eu|d&eu|c&eu|b&eu)'
38
+ return t
39
+
40
+ def t_conjsyll2_a(t):
41
+ r'(&)*(dxhq|khq|dxq|kq|gq|z|y|f)(&)eu'
42
+ return t
43
+
44
+ def t_conjsyll1(t):
45
+ r'(&)*(dxhq|txh|khq|dxq|dxh|tx|th|sx|sh|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)(&)(uu|rq|ou|ii|ei|ee|ax|aa|u|o|i)(&)(dxhq|uuv|txh|rqv|ouv|khq|iiv|eiv|eev|dxq|dxh|aev|aav|uv|uu|tx|th|sx|sh|rq|ph|ov|ou|nx|nj|ng|mq|kq|kh|jh|iv|ii|hq|gq|gh|ei|ee|dx|dh|ch|bh|ax|aa|z|y|w|u|t|s|r|q|p|o|n|m|l|k|j|i|h|g|f|d|c|b)(&)eu(&)(dxhq|txh|khq|dxq|dxh|tx|th|sx|sh|ph|nx|nj|ng|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)'
46
+ return t
47
+
48
+ def t_nukchan_b(t):
49
+ r'(&)*(txh|dxh|tx|th|sx|sh|ph|nx|nj|ng|lx|kh|jh|gh|dx|dh|ch|bh|y|w|t|s|r|p|n|m|l|k|j|h|g|d|c|b)(&)(mq|hq|q)'
50
+ return t
51
+
52
+ def t_nukchan_a(t):
53
+ r'(&)*(dxhq|khq|dxq|kq|gq|z|y|f)(&)(mq|hq|q)'
54
+ return t
55
+
56
+ def t_yarule(t):
57
+ r'(&)*(uuv|rqv|iiv|uv|iv)(&)(y)'
58
+ return t
59
+
60
+ def t_vowel(t):
61
+ r'(&)*(uu&mq|uu&hq|rq&mq|rq&hq|ou&mq|ou&hq|ii&mq|ii&hq|ei&mq|ei&hq|ee&mq|ee&hq|aa&mq|aa&hq|uu&q|u&mq|u&hq|rq&q|ou&q|o&mq|o&hq|ii&q|i&mq|i&hq|ei&q|ee&q|aa&q|a&mq|a&hq|u&q|o&q|i&q|a&q|uu|rq|ou|ii|ei|ee|ax|aa|u|o|i|a)'
62
+ return t
63
+
64
+ def t_fullvowel_a(t):
65
+ r'.'
66
+ return t
67
+
68
+ class Token(NamedTuple):
69
+ type: str
70
+ value: str
71
+
72
+ class Lexer:
73
+ def __init__(self):
74
+ # tokens identified by the lexer
75
+ self.tokens = ('kaki_c', 'conjsyll2_c', 'fullvowel_b', 'kaki_a', 'kaki_b', 'conjsyll2_b', 'conjsyll2_a',
76
+ 'conjsyll1', 'nukchan_b','nukchan_a', 'yarule', 'fullvowel_a', 'vowel')
77
+ self.token_specification = []
78
+ for tkn in self.tokens:
79
+ self.token_specification += [(tkn, r'{}'.format(eval('t_'+tkn).__doc__), eval('t_'+tkn))]
80
+
81
+ self.patterns = []
82
+ for pr in self.token_specification:
83
+ pn = re.compile(pr[1])
84
+ self.patterns += [pn]
85
+ self.tokencount = len(self.token_specification)
86
+ self.data = ''
87
+ self.idx = 0
88
+
89
+ def input(self,data):
90
+ self.data = data
91
+ self.idx = 0
92
+
93
+ def token(self):
94
+ maxlen = 0
95
+ maxidx = -1
96
+ maxmo = None
97
+ for i in range(self.tokencount):
98
+ mo = self.patterns[i].match(self.data, self.idx)
99
+ if mo != None:
100
+ molen = mo.end() - mo.start()
101
+ if molen > maxlen:
102
+ maxlen = molen
103
+ maxidx = i
104
+ maxmo = mo
105
+
106
+ if maxlen == 0:
107
+ return None
108
+ self.idx += maxlen
109
+ tok = self.token_specification[maxidx][2](maxmo.group())
110
+ return Token(type = self.tokens[maxidx], value=tok)
Unified_parser/ply/yacc.py ADDED
The diff for this file is too large to render. See raw diff
 
Unified_parser/punjabi/extract_punjabi.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ words = set()
2
+ with open('text', 'r') as f:
3
+ cnts = f.readlines()
4
+ for l in cnts:
5
+ l = l.strip('\n').split(' ')
6
+ for wd in l[1:]:
7
+ wd = wd.strip('.,|? ')
8
+ if wd != '':
9
+ words.add(wd)
10
+
11
+ words = list(words)
12
+ words = sorted(words)
13
+ with open('punjabi_words.txt', 'w') as f:
14
+ for w in words:
15
+ f.write(f'{w}\n')
Unified_parser/punjabi/punjabi_asr_sample ADDED
The diff for this file is too large to render. See raw diff
 
Unified_parser/punjabi/punjabi_results.txt ADDED
The diff for this file is too large to render. See raw diff
 
Unified_parser/punjabi/punjabi_words.txt ADDED
The diff for this file is too large to render. See raw diff
 
Unified_parser/punjabi/runner_punjabi.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from parser import wordparse
2
+ from joblib import Parallel, delayed
3
+ from tqdm import tqdm
4
+
5
+ with open('punjabi_words.txt', 'r') as f:
6
+ words = f.readlines()
7
+
8
+ words = [wd.strip() for wd in words]
9
+ anslist = Parallel(n_jobs=1)(delayed(wordparse)(wd, 0, 0) for wd in tqdm(words))
10
+
11
+ with open('punjabi_results.txt', 'w') as f:
12
+ for i in range(len(words)):
13
+ f.write(f'{words[i]} = {anslist[i]}\n')
Unified_parser/pypi_package/LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2022 vikram-kv
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
Unified_parser/pypi_package/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python_Unified_Parser
2
+
3
+ This parser attempts to unify the languages based on the Common Label Set. It is designed across all the languages capitalising on the syllable structure of Indian languages. The Unified Parser converts UTF-8 text to common label set, applies letter-to-sound rules and generates the corresponding phoneme sequences. The effort is a step towards natural language understanding system that operates on Indian languages and generates the parsed output. This structured method requires only knowledge of the basic language. With good lexicons it is possible to get more than 95% correctness of words in a language. This method can be further extended for a number of other Indian languages in minimal time and effort. Given the unity in the diversity of Indian languages, developing parsers for new languages is easy using the unified approach.
4
+
5
+ Our python parser - [uparser.py](src/indic-unified-parser/uparser.py) - Combines lex and yacc functionality in a single python script using the [PLY](src/indic-unified-parser/ply) framework.
6
+
7
+ ## Publications
8
+ [Baby, Arun, et al. "A unified parser for developing Indian language text to speech synthesizers." Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings 19. Springer International Publishing, 2016.](https://www.iitm.ac.in/donlab/tts/downloads/unified/unified.pdf)
9
+
10
+ ## Installation
11
+
12
+ ```bash
13
+ pip install indic_unified_parser
14
+ ```
15
+
16
+ ## How to use
17
+
18
+ ```bash
19
+ from indic_unified_parser.uparser import wordparse
20
+ parsed_output_string = wordparse(<word : str>, <lsflag : int>, <wfflag : int>, <clearflag : int>)
21
+ ```
22
+
23
+ 1. `lsflag`: always 0. Deprecated.
24
+ 2. `wfflag`: 0 for Monophone parsing, 1 for syllable parsing, 2 for Akshara Parsing"
25
+ 3. `clearflag`: 1 for removing the lisp like format of output and to just produce space separated output. Otherwise, 0.
26
+
27
+ ## Examples
28
+
29
+ ## URLS
30
+ [Homepage](https://github.com/vikram-kv/Unified_Parser)
31
+
32
+ ## Authors
33
+
34
+ Vikram K V, Dual Degree, Computer Science Dept, IIT Madras.