harveen
Harveen | Adding code
74fc30d
|
raw
history blame
2.8 kB

Transliteration Models for Indian languages

These are models for transliteration involving Indian languages. The models are essentially Statistical Machine Translation systems trained using Moses over a character-level parallel corpora of transliterations. Hence, you will need Moses to use these transliteration models. The transliteration corpus has itself been mined in an unsupervised fashion from a translation corpus.

Currently we have trained transliteration models for five language pairs: bn-hi, ta-hi, te-hi, en-hi and mr-hi

Support for transliteration has been introduced in Moses from version 2.1
So please ensure that you have minimum 2.1 version setup for Moses

Commands to run the transliteration module using moses

$moseshome/mosesdecoder/scripts/Transliteration/post-decoding-transliteration.pl \
--moses-src-dir $moseshome/mosesdecoder --external-bin-dir $moseshome/tools \
--transliteration-model-dir {path to transliteration model folder} --oov-file {path to file containing oov words, oovs are space separated with each line containing all oovs for the input line}\
--input-file {input file to transliterated} --output-file {output file location} \
--input-extension {input language code for eg. en} --output-extension {output language code for eg. hi} --language-model {path to language model} \
--decoder $moseshome/mosesdecoder/bin/moses

A sample execution of the model will be as follows:

export moseshome={path to moses installation}
$moseshome/mosesdecoder/scripts/Transliteration/post-decoding-transliteration.pl \
--moses-src-dir $moseshome/mosesdecoder --external-bin-dir $moseshome/tools \
--transliteration-model-dir /home/ratish/project/nlp_resources/indic_nlp_resources/transliterate/en-hi \
--oov-file /home/ratish/project/translit/input.oov \
--input-file /home/ratish/project/translit/input.en \
--output-file /home/ratish/project/translit/output.hi \
--input-extension en --output-extension hi --language-model /home/ratish/project/translit/lm/nc.binlm.1 \
--decoder $moseshome/mosesdecoder/bin/moses

So far, we have seen the use of transliteration in a post-editing task for machine translation task. In case, the models are needed for purely transliteration purpose, the input file and OOV file are the same.
Sample input file:
New Delhi is capital of India
India is worlds seventh largest nation in the World

OOV file
New Delhi is capital of India
India is worlds seventh largest nation in the World

On running the transliteration module, the output is:
न्यू डेल्ही इस कैपिटल आफ इंडिया
इंडिया इस वर्ल्ड सेवंथ लारगेस्ट नेशन इन थे वर्ल्ड