Spaces:

balaramas
/

s2t_translator

Runtime error

App Files Files Community

balaramas commited on Jun 28, 2023

Commit

4f94afb

1 Parent(s): 9681ee2

Upload 26 files

Browse files

Files changed (27) hide show

.gitattributes +2 -0
MUSTC_ROOT_french/en-fr/config_st.yaml +19 -0
MUSTC_ROOT_french/en-fr/data/tst-COMMON/txt/tst-COMMON.en +8 -0
MUSTC_ROOT_french/en-fr/data/tst-COMMON/txt/tst-COMMON.fr +8 -0
MUSTC_ROOT_french/en-fr/data/tst-COMMON/txt/tst-COMMON.yaml +8 -0
MUSTC_ROOT_french/en-fr/data/tst-COMMON/wav/ted_1096.wav +3 -0
MUSTC_ROOT_french/en-fr/spm_unigram8000_st.model +3 -0
MUSTC_ROOT_french/en-fr/spm_unigram8000_st.txt +0 -0
MUSTC_ROOT_french/en-fr/tst-COMMON_st.tsv +0 -0
MUSTC_ROOT_hindi/en-hi/config_st.yaml +19 -0
MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/txt/tst-COMMON.en +8 -0
MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/txt/tst-COMMON.hi +8 -0
MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/txt/tst-COMMON.yaml +8 -0
MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/wav/ted_1096.wav +3 -0
MUSTC_ROOT_hindi/en-hi/fbank80.zip +3 -0
MUSTC_ROOT_hindi/en-hi/spm_unigram8000_st.model +3 -0
MUSTC_ROOT_hindi/en-hi/spm_unigram8000_st.txt +0 -0
MUSTC_ROOT_hindi/en-hi/tst-COMMON_st.tsv +33 -0
app.py +107 -0
data_utils.py +383 -0
models/french_model.pt +3 -0
models/hindi_model.pt +3 -0
prep_mustc_data_hindi_single.py +263 -0
s2t_en2hi.py +32 -0
s2t_en2hi_nolog.py +32 -0
test.wav +0 -0
test2.wav +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+MUSTC_ROOT_french/en-fr/data/tst-COMMON/wav/ted_1096.wav filter=lfs diff=lfs merge=lfs -text
+MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/wav/ted_1096.wav filter=lfs diff=lfs merge=lfs -text

MUSTC_ROOT_french/en-fr/config_st.yaml ADDED Viewed

	@@ -0,0 +1,19 @@

+bpe_tokenizer:
+  bpe: sentencepiece
+  sentencepiece_model: /media/lab202/BALARAM_HDD/MUSTC_v1.0_en-fr/en-fr/spm_unigram8000_st.model
+input_channels: 1
+input_feat_per_channel: 80
+specaugment:
+  freq_mask_F: 27
+  freq_mask_N: 1
+  time_mask_N: 1
+  time_mask_T: 100
+  time_mask_p: 1.0
+  time_wrap_W: 0
+transforms:
+  '*':
+  - utterance_cmvn
+  _train:
+  - utterance_cmvn
+  - specaugment
+vocab_filename: spm_unigram8000_st.txt

MUSTC_ROOT_french/en-fr/data/tst-COMMON/txt/tst-COMMON.en ADDED Viewed

	@@ -0,0 +1,8 @@

+Back in New York, I am the head of development for a non-profit called Robin Hood.
+When I'm not fighting poverty, I'm fighting fires as the assistant captain of a volunteer fire company.
+Now in our town, where the volunteers supplement a highly skilled career staff, you have to get to the fire scene pretty early to get in on any action.
+I remember my first fire.
+I was the second volunteer on the scene, so there was a pretty good chance I was going to get in.
+But still it was a real footrace against the other volunteers to get to the captain in charge to find out what our assignments would be.
+When I found the captain, he was having a very engaging conversation with the homeowner, who was surely having one of the worst days of her life.
+Here it was, the middle of the night, she was standing outside in the pouring rain, under an umbrella, in her pajamas, barefoot, while her house was in flames.

MUSTC_ROOT_french/en-fr/data/tst-COMMON/txt/tst-COMMON.fr ADDED Viewed

	@@ -0,0 +1,8 @@

+A New York, je suis responsable du développment pour un organisme à but non lucratif appelé Robin Hood.
+Quand je ne suis pas en train de combattre la pauvreté, je combat des incendies en tant qu'assistant capitaine d'une compagnie de pompiers volontaires.
+Et dans notre ville, où les volontaires viennent renforcer une équipe professionnelle hautement qualifiée, il faut arriver sur le lieu de l'incendie très tôt pour prendre part à l'action.
+Je me souviens de mon premier incendie.
+J'étais le deuxième volontaire sur les lieux, et donc j'avais de bonnes chances d'y aller.
+Mais pourtant c'était une vrai course à pied contre les autres volontaires pour arriver jusqu'au capitaine responsable pour découvrir ce que seraient nos missions.
+Quand j'ai trouvé le capitaine, il était en pleine conversation avec la propriétaire, qui était surement en train de vivre la pire journée de sa vie.
+C'était en pleine nuit, elle était là dehors sous la pluie battante, sous un parapluie, en pyjama, pieds nus, pendant que sa maison était en flammes.

MUSTC_ROOT_french/en-fr/data/tst-COMMON/txt/tst-COMMON.yaml ADDED Viewed

	@@ -0,0 +1,8 @@

+- {duration: 5.0, offset: 0.0, rW: 17, uW: 0, speaker_id: spk.1096, wav: test.wav}
+- {duration: 5.160000, offset: 20.290000, rW: 17, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 8.110000, offset: 25.930000, rW: 29, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 1.560000, offset: 34.920000, rW: 5, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 4.180000, offset: 36.730000, rW: 21, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 5.580000, offset: 41.880000, rW: 26, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 8.610001, offset: 48.309999, rW: 27, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 9.680000, offset: 57.510000, rW: 29, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}

MUSTC_ROOT_french/en-fr/data/tst-COMMON/wav/ted_1096.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:69a122c3ad89320ec24cad84b622a01f26c3138b3e5869dc033e65bd0ab73fe1
+size 8990102

MUSTC_ROOT_french/en-fr/spm_unigram8000_st.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f0ee92c9ab8210fb21647509df3f65ae36527a3659005b05491da04008939098
+size 381797

MUSTC_ROOT_french/en-fr/spm_unigram8000_st.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

MUSTC_ROOT_french/en-fr/tst-COMMON_st.tsv ADDED Viewed

The diff for this file is too large to render. See raw diff

MUSTC_ROOT_hindi/en-hi/config_st.yaml ADDED Viewed

	@@ -0,0 +1,19 @@

+bpe_tokenizer:
+  bpe: sentencepiece
+  sentencepiece_model: ./spm_unigram8000_st.model
+input_channels: 1
+input_feat_per_channel: 80
+specaugment:
+  freq_mask_F: 27
+  freq_mask_N: 1
+  time_mask_N: 1
+  time_mask_T: 100
+  time_mask_p: 1.0
+  time_wrap_W: 0
+transforms:
+  '*':
+  - utterance_cmvn
+  _train:
+  - utterance_cmvn
+  - specaugment
+vocab_filename: spm_unigram8000_st.txt

MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/txt/tst-COMMON.en ADDED Viewed

	@@ -0,0 +1,8 @@

+Back in New York, I am the head of development for a non-profit called Robin Hood.
+When I'm not fighting poverty, I'm fighting fires as the assistant captain of a volunteer fire company.
+Now in our town, where the volunteers supplement a highly skilled career staff, you have to get to the fire scene pretty early to get in on any action.
+I remember my first fire.
+I was the second volunteer on the scene, so there was a pretty good chance I was going to get in.
+But still it was a real footrace against the other volunteers to get to the captain in charge to find out what our assignments would be.
+When I found the captain, he was having a very engaging conversation with the homeowner, who was surely having one of the worst days of her life.
+Here it was, the middle of the night, she was standing outside in the pouring rain, under an umbrella, in her pajamas, barefoot, while her house was in flames.

MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/txt/tst-COMMON.hi ADDED Viewed

	@@ -0,0 +1,8 @@

+न्यूयॉर्क में वापस, मैं रॉबिन हुड नामक एक गैर-लाभकारी संस्था के विकास का प्रमुख हूं।
+जब मैं गरीबी से नहीं लड़ रहा हूं, तो मैं स्वयंसेवी फायर कंपनी के सहायक कप्तान के रूप में आग से लड़ रहा हूं।
+अब हमारे शहर में, जहां स्वयंसेवक एक अत्यधिक कुशल कैरियर स्टाफ के पूरक हैं, आपको किसी भी कार्रवाई में शामिल होने के लिए आग के दृश्य पर बहुत जल्दी पहुंचना होगा।
+मुझे अपनी पहली आग याद है।
+मैं इस दृश्य पर दूसरा स्वयंसेवक था, इसलिए मेरे अंदर आने का एक अच्छा मौका था।
+लेकिन फिर भी यह अन्य स्वयंसेवकों के खिलाफ एक वास्तविक पदयात्रा थी जो प्रभारी कप्तान के पास यह पता लगाने के लिए थी कि हमारा कार्य क्या होगा।
+जब मैंने कप्तान को पाया, तो वह गृहस्वामी के साथ बहुत ही आकर्षक बातचीत कर रहा था, जो निश्चित रूप से उसके जीवन के सबसे बुरे दिनों में से एक था।
+यहाँ यह आधी रात थी, वह बारिश में बाहर, एक छतरी के नीचे, अपने पजामे में, नंगे पाँव खड़ी थी, जबकि उसका घर आग की लपटों में था।

MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/txt/tst-COMMON.yaml ADDED Viewed

	@@ -0,0 +1,8 @@

+- {duration: 5.0, offset: 0.0, rW: 17, uW: 0, speaker_id: spk.1096, wav: test.wav}
+- {duration: 5.160000, offset: 20.290000, rW: 17, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 8.110000, offset: 25.930000, rW: 29, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 1.560000, offset: 34.920000, rW: 5, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 4.180000, offset: 36.730000, rW: 21, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 5.580000, offset: 41.880000, rW: 26, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 8.610001, offset: 48.309999, rW: 27, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}
+- {duration: 9.680000, offset: 57.510000, rW: 29, uW: 0, speaker_id: spk.1096, wav: ted_1096.wav}

MUSTC_ROOT_hindi/en-hi/data/tst-COMMON/wav/ted_1096.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:69a122c3ad89320ec24cad84b622a01f26c3138b3e5869dc033e65bd0ab73fe1
+size 8990102

MUSTC_ROOT_hindi/en-hi/fbank80.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0bef03a45d7514d5018c4de30d352c736359248e6e8d70d586796aa32b30f4e2
+size 5242360

MUSTC_ROOT_hindi/en-hi/spm_unigram8000_st.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bf7b26c17db61dcd76400fbb74c5395d5f13837ed0fd5fa1098930de4f2a8202
+size 449800

MUSTC_ROOT_hindi/en-hi/spm_unigram8000_st.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

MUSTC_ROOT_hindi/en-hi/tst-COMMON_st.tsv ADDED Viewed

	@@ -0,0 +1,33 @@

+id	audio	n_frames	tgt_text	speaker
+test_0	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:3674535:136768	427	न्यूयॉर्क में वापस, मैं रॉबिन हुड नामक एक गैर-लाभकारी संस्था के विकास का प्रमुख हूं।	spk.1096
+ted_1096_0	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:44:44928	140	कप्तान ने मुझे लहराया।	spk.1096
+ted_1096_1	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:296095:272128	850	उन्होंने कहा, "बेज़ोस, मैं चाहता हूं कि आप घर में जाएं। मैं चाहता हूं कि आप ऊपर जाएं, आग को पार करें, और मैं चाहता हूं कि आप इस महिला को एक जोड़ी जूते दिलवाएं।"	spk.1096
+ted_1096_2	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:4231705:6208	19	(हँसी)	spk.1096
+ted_1096_3	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:5032741:14848	46	कसम है।	spk.1096
+ted_1096_4	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:1725316:447168	1397	तो, ठीक वैसा नहीं जैसा मैं उम्मीद कर रहा था, लेकिन मैं चला गया - सीढ़ियों से ऊपर, हॉल के नीचे, 'असली' अग्निशामकों के पीछे, जो इस बिंदु पर आग बुझाने के लिए बहुत कुछ कर चुके थे, मास्टर बेडरूम में	spk.1096
+ted_1096_5	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:3811347:152768	477	अब मुझे पता है कि तुम क्या सोच रहे हो, लेकिन मैं हीरो नहीं हूं।	spk.1096
+ted_1096_6	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:4237957:194048	606	मैं अपना पेलोड वापस नीचे की ओर ले गया जहाँ मैं अपने दास और कीमती कुत्ते से सामने के दरवाजे से मिला।	spk.1096
+ted_1096_7	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:725413:236928	740	हम अपने खजानों को बाहर गृहस्वामी के पास ले गए, जहां आश्चर्य की बात नहीं कि मेरे खजानों की तुलना में उनका अधिक ध्यान गया।	spk.1096
+ted_1096_8	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:5047633:192768	602	कुछ सप्ताह बाद, विभाग को गृहस्वामी की ओर से एक पत्र प्राप्त हुआ जिसमें उन्होंने उसके घर को बचाने के लिए किए गए साहसिक प्रयास के लिए हमें धन्यवाद दिया।	spk.1096
+ted_1096_9	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:1184318:188928	590	दयालुता का कार्य उसने अन्य सभी से ऊपर देखा: किसी ने उसे एक जोड़ी जूते भी दिलवाए थे।	spk.1096
+ted_1096_10	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:2172529:425408	1329	रॉबिन हुड में मेरे व्यवसाय और स्वयंसेवी फायर फाइटर के रूप में मेरे व्यवसाय दोनों में, मैं एक बड़े पैमाने पर उदारता और दयालुता के कृत्यों का साक्षी हूं, लेकिन मैं व्यक्तिगत आधार पर अनुग्रह और साहस के कार्यों का भी गवाह हूं।	spk.1096
+ted_1096_11	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:3267448:28608	89	और आप जानते हैं कि मैंने क्या सीखा है?	spk.1096
+ted_1096_12	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:701561:23808	74	वे सब मायने रखते हैं।	spk.1096
+ted_1096_13	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:2597982:338688	1058	इसलिए जब मैं इस कमरे के चारों ओर ऐसे लोगों को देखता हूं, जिन्होंने या तो सफलता के उल्लेखनीय स्तर हासिल किए हैं, या हासिल करने के रास्ते पर हैं, तो मैं यह याद दिलाता हूं: प्रतीक्षा न करें।	spk.1096
+ted_1096_14	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:4432050:164608	514	जब मैं गरीबी से नहीं लड़ रहा हूं, तो मैं स्वयंसेवी फायर कंपनी के सहायक कप्तान के रूप में आग से लड़ रहा हूं।	spk.1096
+ted_1096_15	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:198963:97088	303	किसी के जीवन में बदलाव लाने के लिए अपना पहला मिलियन बनाने तक प्रतीक्षा न करें।	spk.1096
+ted_1096_16	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:3964160:89408	279	अगर आपके पास देने के लिए कुछ है, तो अभी दे दो।	spk.1096
+ted_1096_17	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:1373291:92928	290	सूप किचन में खाना परोसें।	spk.1096
+ted_1096_18	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:4596703:24448	76	एक संरक्षक बनें।	spk.1096
+ted_1096_19	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:1466264:259008	809	अब हमारे शहर में, जहां स्वयंसेवक एक अत्यधिक कुशल कैरियर स्टाफ के पूरक हैं, आपको किसी भी कार्रवाई में शामिल होने के लिए आग के दृश्य पर बहुत जल्दी पहुंचना होगा।	spk.1096
+ted_1096_20	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:45017:49408	154	मुझे अपनी पहली आग याद है।	spk.1096
+ted_1096_21	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:568268:133248	416	मैं इस दृश्य पर दूसरा स्वयंसेवक था, इसलिए मेरे अंदर आने का एक अच्छा मौका था।	spk.1096
+ted_1096_22	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:4053613:178048	556	लेकिन फिर भी यह अन्य स्वयंसेवकों के खिलाफ एक वास्तविक पदयात्रा थी जो प्रभारी कप्तान के पास यह पता लगाने के लिए थी कि हमारा कार्य क्या होगा।	spk.1096
+ted_1096_23	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:4757689:275008	859	जब मैंने कप्तान को पाया, तो वह गृहस्वामी के साथ बहुत ही आकर्षक बातचीत कर रहा था, जो निश्चित रूप से उसके जीवन के सबसे बुरे दिनों में से एक था।	spk.1096
+ted_1096_24	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:3365247:309248	966	यहाँ यह आधी रात थी, वह बारिश में बाहर, एक छतरी के नीचे, अपने पजामे में, नंगे पाँव खड़ी थी, जबकि उसका घर आग की लपटों में था।	spk.1096
+ted_1096_25	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:4621196:136448	426	दूसरा स्वयंसेवक जो मुझसे ठीक पहले आया था -- चलो उसे लेक्स लूथर कहते हैं --	spk.1096
+ted_1096_26	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:3296101:3968	12	(हँसी)	spk.1096
+ted_1096_27	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:962386:221888	693	पहले कप्तान के पास गया और उसे अंदर जाकर गृहस्वामी के कुत्ते को बचाने के लिए कहा गया।	spk.1096
+ted_1096_28	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:94470:104448	326	कुत्ता!	spk.1096
+ted_1096_29	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:2936715:330688	1033	यहाँ कोई वकील या मनी मैनेजर था, जो अपने पूरे जीवन के लिए लोगों को बताता है कि वह एक जलती हुई इमारत में एक जीवित प्राणी को बचाने के लिए गया था, सिर्फ इसलिए कि उसने मुझे पाँच सेकंड से पीटा।	spk.1096
+ted_1096_30	/home/deepakprasad/nlp_code/fairseq_mustc_single_inference/MUSTC_ROOT/en-hi/fbank80.zip:3300114:65088	203	खैर, मैं अगला था।	spk.1096

app.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""
+Script to translate given single english audio file to corresponding hindi text
+Usage : python s2t_en2hi.py <audio_file_path> <averaged_checkpoints_file_path>
+"""
+import gradio as gr
+import sys
+import os
+import subprocess
+from pydub import AudioSegment
+from huggingface_hub import snapshot_download
+def install_fairseq():
+    try:
+        # Run pip install command to install fairseq
+        subprocess.check_call(["pip", "install", "fairseq"])
+        subprocess.check_call(["pip", "install", "sentencepiece"])
+        subprocess.check_call(["pip", "install", "soundfile"])
+        return "fairseq successfully installed!"
+    except subprocess.CalledProcessError as e:
+        return f"An error occurred while installing fairseq: {str(e)}"
+def convert_audio_to_16k_wav(audio_input):
+    sound = AudioSegment.from_file(audio_input)
+    sample_rate = sound.frame_rate
+    num_channels = sound.channels
+    num_frames = int(sound.frame_count())
+    filename = audio_input.split("/")[-1]
+    print("original file is at:", audio_input)
+    if (num_channels > 1) or (sample_rate != 16000): # convert to mono-channel 16k wav
+        if num_channels > 1:
+            sound = sound.set_channels(1)
+        if sample_rate != 16000:
+            sound = sound.set_frame_rate(16000)
+        num_frames = int(sound.frame_count())
+        filename = filename.replace(".wav", "") + "_16k.wav"
+        sound.export(f"{filename}", format="wav")
+    return filename
+def run_my_code(input_text, language):
+    # TODO better argument handling
+    audio=convert_audio_to_16k_wav(input_text)
+    hi_wav = audio
+    data_root=""
+    model_checkpoint=""
+    d_r=""
+    if(language=="Hindi"):
+        model_checkpoint = "./models/hindi_model.pt"
+        data_root="./MUSTC_ROOT_hindi/en-hi/"
+        d_r="MUSTC_ROOT_hindi/"
+    if(language=="French"):
+        model_checkpoint = "./models/french_model.pt"
+        data_root="./MUSTC_ROOT_french/en-fr/"
+        d_r="MUSTC_ROOT_french/"
+    os.system(f"cp {hi_wav} {data_root}data/tst-COMMON/wav/test.wav")
+    print("------Starting data prepration...")
+    subprocess.run(["python", "prep_mustc_data_hindi_single.py", "--data-root", d_r, "--task", "st", "--vocab-type", "unigram", "--vocab-size", "8000"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    print("------Performing translation...")
+    translation_result = subprocess.run(["fairseq-generate", data_root, "--config-yaml", "config_st.yaml", "--gen-subset", "tst-COMMON_st", "--task", "speech_to_text", "--path", model_checkpoint, "--max-tokens", "50000", "--beam", "5", "--scoring", "sacrebleu"], capture_output=True, text=True)
+    translation_result_text = translation_result.stdout
+    lines = translation_result_text.split("\n")
+    output_text=""
+    print("\n\n------Translation results are:")
+    for i in lines:
+        if (i.startswith("D-0")):
+            print(i.split("\t")[2])
+            output_text=i.split("\t")[2]
+            break
+    os.system(f"rm {data_root}data/tst-COMMON/wav/test.wav")
+    return output_text
+install_fairseq()
+# Define the input and output interfaces for Gradio
+#inputs = [
+  #      gr.inputs.Audio(source="microphone", type="filepath", label="Record something (in English)..."),
+  #      gr.inputs.Dropdown(list(LANGUAGE_CODES.keys()), default="Hindi", label="From English to Languages X..."),
+   # ]
+#input_textbox = gr.inputs.Textbox(label="test2.wav")
+#input=gr.inputs.Audio(source="microphone", type="filepath", label="Record something (in English)...")
+#audio=convert_audio_to_16k_wav(input)
+output_textbox = gr.outputs.Textbox(label="Output Text")
+# Create a Gradio interface
+iface = gr.Interface(
+        fn=run_my_code,
+        inputs=[gr.inputs.Audio(source="microphone", type="filepath", label="Record something (in English)..."), gr.inputs.Radio(["Hindi", "French"], label="Language")],
+        outputs=output_textbox,
+        title="English to Hindi Translator")
+# Launch the interface
+iface.launch()

data_utils.py ADDED Viewed

	@@ -0,0 +1,383 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import csv
+from pathlib import Path
+import zipfile
+from functools import reduce
+from multiprocessing import cpu_count
+from typing import Any, Dict, List, Optional, Union
+import io
+import numpy as np
+import pandas as pd
+import sentencepiece as sp
+from fairseq.data.audio.audio_utils import (
+    convert_waveform, _get_kaldi_fbank, _get_torchaudio_fbank, is_npy_data,
+    is_sf_audio_data
+)
+import torch
+import soundfile as sf
+from tqdm import tqdm
+UNK_TOKEN, UNK_TOKEN_ID = "<unk>", 3
+BOS_TOKEN, BOS_TOKEN_ID = "<s>", 0
+EOS_TOKEN, EOS_TOKEN_ID = "</s>", 2
+PAD_TOKEN, PAD_TOKEN_ID = "<pad>", 1
+def gen_vocab(
+    input_path: Path, output_path_prefix: Path, model_type="bpe",
+    vocab_size=1000, special_symbols: Optional[List[str]] = None
+):
+    # Train SentencePiece Model
+    arguments = [
+        f"--input={input_path.as_posix()}",
+        f"--model_prefix={output_path_prefix.as_posix()}",
+        f"--model_type={model_type}",
+        f"--vocab_size={vocab_size}",
+        "--character_coverage=1.0",
+        f"--num_threads={cpu_count()}",
+        f"--unk_id={UNK_TOKEN_ID}",
+        f"--bos_id={BOS_TOKEN_ID}",
+        f"--eos_id={EOS_TOKEN_ID}",
+        f"--pad_id={PAD_TOKEN_ID}",
+    ]
+    if special_symbols is not None:
+        _special_symbols = ",".join(special_symbols)
+        arguments.append(f"--user_defined_symbols={_special_symbols}")
+    sp.SentencePieceTrainer.Train(" ".join(arguments))
+    # Export fairseq dictionary
+    spm = sp.SentencePieceProcessor()
+    spm.Load(output_path_prefix.as_posix() + ".model")
+    vocab = {i: spm.IdToPiece(i) for i in range(spm.GetPieceSize())}
+    assert (
+        vocab.get(UNK_TOKEN_ID) == UNK_TOKEN
+        and vocab.get(PAD_TOKEN_ID) == PAD_TOKEN
+        and vocab.get(BOS_TOKEN_ID) == BOS_TOKEN
+        and vocab.get(EOS_TOKEN_ID) == EOS_TOKEN
+    )
+    vocab = {
+        i: s
+        for i, s in vocab.items()
+        if s not in {UNK_TOKEN, BOS_TOKEN, EOS_TOKEN, PAD_TOKEN}
+    }
+    with open(output_path_prefix.as_posix() + ".txt", "w") as f_out:
+        for _, s in sorted(vocab.items(), key=lambda x: x[0]):
+            f_out.write(f"{s} 1\n")
+def extract_fbank_features(
+    waveform: torch.FloatTensor,
+    sample_rate: int,
+    output_path: Optional[Path] = None,
+    n_mel_bins: int = 80,
+    overwrite: bool = False,
+):
+    if output_path is not None and output_path.is_file() and not overwrite:
+        return
+    _waveform, _ = convert_waveform(waveform, sample_rate, to_mono=True)
+    # Kaldi compliance: 16-bit signed integers
+    _waveform = _waveform * (2 ** 15)
+    _waveform = _waveform.numpy()
+    features = _get_kaldi_fbank(_waveform, sample_rate, n_mel_bins)
+    if features is None:
+        features = _get_torchaudio_fbank(_waveform, sample_rate, n_mel_bins)
+    if features is None:
+        raise ImportError(
+            "Please install pyKaldi or torchaudio to enable fbank feature extraction"
+        )
+    if output_path is not None:
+        np.save(output_path.as_posix(), features)
+    return features
+def create_zip(data_root: Path, zip_path: Path):
+    paths = list(data_root.glob("*.npy"))
+    paths.extend(data_root.glob("*.flac"))
+    with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_STORED) as f:
+        for path in tqdm(paths):
+            f.write(path, arcname=path.name)
+def get_zip_manifest(
+        zip_path: Path, zip_root: Optional[Path] = None, is_audio=False
+):
+    _zip_path = Path.joinpath(zip_root or Path(""), zip_path)
+    with zipfile.ZipFile(_zip_path, mode="r") as f:
+        info = f.infolist()
+    paths, lengths = {}, {}
+    for i in tqdm(info):
+        utt_id = Path(i.filename).stem
+        offset, file_size = i.header_offset + 30 + len(i.filename), i.file_size
+        paths[utt_id] = f"{zip_path.as_posix()}:{offset}:{file_size}"
+        with open(_zip_path, "rb") as f:
+            f.seek(offset)
+            byte_data = f.read(file_size)
+            assert len(byte_data) > 1
+            if is_audio:
+                assert is_sf_audio_data(byte_data), i
+            else:
+                assert is_npy_data(byte_data), i
+            byte_data_fp = io.BytesIO(byte_data)
+            if is_audio:
+                lengths[utt_id] = sf.info(byte_data_fp).frames
+            else:
+                lengths[utt_id] = np.load(byte_data_fp).shape[0]
+    return paths, lengths
+def gen_config_yaml(
+    manifest_root: Path,
+    spm_filename: Optional[str] = None,
+    vocab_name: Optional[str] = None,
+    yaml_filename: str = "config.yaml",
+    specaugment_policy: Optional[str] = "lb",
+    prepend_tgt_lang_tag: bool = False,
+    sampling_alpha: Optional[float] = None,
+    input_channels: Optional[int] = 1,
+    input_feat_per_channel: Optional[int] = 80,
+    audio_root: str = "",
+    cmvn_type: str = "utterance",
+    gcmvn_path: Optional[Path] = None,
+    extra=None
+):
+    manifest_root = manifest_root.absolute()
+    writer = S2TDataConfigWriter(manifest_root / yaml_filename)
+    assert spm_filename is not None or vocab_name is not None
+    vocab_name = spm_filename.replace(".model", ".txt") if vocab_name is None \
+        else vocab_name
+    writer.set_vocab_filename(vocab_name)
+    if input_channels is not None:
+        writer.set_input_channels(input_channels)
+    if input_feat_per_channel is not None:
+        writer.set_input_feat_per_channel(input_feat_per_channel)
+    specaugment_setters = {
+        "lb": writer.set_specaugment_lb_policy,
+        "ld": writer.set_specaugment_ld_policy,
+        "sm": writer.set_specaugment_sm_policy,
+        "ss": writer.set_specaugment_ss_policy,
+    }
+    specaugment_setter = specaugment_setters.get(specaugment_policy, None)
+    if specaugment_setter is not None:
+        specaugment_setter()
+    if spm_filename is not None:
+        writer.set_bpe_tokenizer(
+            {
+                "bpe": "sentencepiece",
+                "sentencepiece_model": (manifest_root / spm_filename).as_posix(),
+            }
+        )
+    if prepend_tgt_lang_tag:
+        writer.set_prepend_tgt_lang_tag(True)
+    if sampling_alpha is not None:
+        writer.set_sampling_alpha(sampling_alpha)
+    if cmvn_type not in ["global", "utterance"]:
+        raise NotImplementedError
+    if specaugment_policy is not None:
+        writer.set_feature_transforms(
+            "_train", [f"{cmvn_type}_cmvn", "specaugment"]
+        )
+    writer.set_feature_transforms("*", [f"{cmvn_type}_cmvn"])
+    if cmvn_type == "global":
+        if gcmvn_path is None:
+            raise ValueError("Please provide path of global cmvn file.")
+        else:
+            writer.set_global_cmvn(gcmvn_path.as_posix())
+    if len(audio_root) > 0:
+        writer.set_audio_root(audio_root)
+    if extra is not None:
+        writer.set_extra(extra)
+    writer.flush()
+def load_df_from_tsv(path: Union[str, Path]) -> pd.DataFrame:
+    _path = path if isinstance(path, str) else path.as_posix()
+    return pd.read_csv(
+        _path,
+        sep="\t",
+        header=0,
+        encoding="utf-8",
+        escapechar="\\",
+        quoting=csv.QUOTE_NONE,
+        na_filter=False,
+    )
+def save_df_to_tsv(dataframe, path: Union[str, Path]):
+    _path = path if isinstance(path, str) else path.as_posix()
+    dataframe.to_csv(
+        _path,
+        sep="\t",
+        header=True,
+        index=False,
+        encoding="utf-8",
+        escapechar="\\",
+        quoting=csv.QUOTE_NONE,
+    )
+def load_tsv_to_dicts(path: Union[str, Path]) -> List[dict]:
+    with open(path, "r") as f:
+        reader = csv.DictReader(
+            f,
+            delimiter="\t",
+            quotechar=None,
+            doublequote=False,
+            lineterminator="\n",
+            quoting=csv.QUOTE_NONE,
+        )
+        rows = [dict(e) for e in reader]
+    return rows
+def filter_manifest_df(
+    df, is_train_split=False, extra_filters=None, min_n_frames=5, max_n_frames=3000
+):
+    filters = {
+        "no speech": df["audio"] == "",
+        f"short speech (<{min_n_frames} frames)": df["n_frames"] < min_n_frames,
+        "empty sentence": df["tgt_text"] == "",
+    }
+    if is_train_split:
+        filters[f"long speech (>{max_n_frames} frames)"] = df["n_frames"] > max_n_frames
+    if extra_filters is not None:
+        filters.update(extra_filters)
+    invalid = reduce(lambda x, y: x | y, filters.values())
+    valid = ~invalid
+    print(
+        "| "
+        + ", ".join(f"{n}: {f.sum()}" for n, f in filters.items())
+        + f", total {invalid.sum()} filtered, {valid.sum()} remained."
+    )
+    return df[valid]
+def cal_gcmvn_stats(features_list):
+    features = np.concatenate(features_list)
+    square_sums = (features ** 2).sum(axis=0)
+    mean = features.mean(axis=0)
+    features = np.subtract(features, mean)
+    var = square_sums / features.shape[0] - mean ** 2
+    std = np.sqrt(np.maximum(var, 1e-8))
+    return {"mean": mean.astype("float32"), "std": std.astype("float32")}
+class S2TDataConfigWriter(object):
+    DEFAULT_VOCAB_FILENAME = "dict.txt"
+    DEFAULT_INPUT_FEAT_PER_CHANNEL = 80
+    DEFAULT_INPUT_CHANNELS = 1
+    def __init__(self, yaml_path: Path):
+        try:
+            import yaml
+        except ImportError:
+            print("Please install PyYAML for S2T data config YAML files")
+        self.yaml = yaml
+        self.yaml_path = yaml_path
+        self.config = {}
+    def flush(self):
+        with open(self.yaml_path, "w") as f:
+            self.yaml.dump(self.config, f)
+    def set_audio_root(self, audio_root=""):
+        self.config["audio_root"] = audio_root
+    def set_vocab_filename(self, vocab_filename: str = "dict.txt"):
+        self.config["vocab_filename"] = vocab_filename
+    def set_specaugment(
+        self,
+        time_wrap_w: int,
+        freq_mask_n: int,
+        freq_mask_f: int,
+        time_mask_n: int,
+        time_mask_t: int,
+        time_mask_p: float,
+    ):
+        self.config["specaugment"] = {
+            "time_wrap_W": time_wrap_w,
+            "freq_mask_N": freq_mask_n,
+            "freq_mask_F": freq_mask_f,
+            "time_mask_N": time_mask_n,
+            "time_mask_T": time_mask_t,
+            "time_mask_p": time_mask_p,
+        }
+    def set_specaugment_lb_policy(self):
+        self.set_specaugment(
+            time_wrap_w=0,
+            freq_mask_n=1,
+            freq_mask_f=27,
+            time_mask_n=1,
+            time_mask_t=100,
+            time_mask_p=1.0,
+        )
+    def set_specaugment_ld_policy(self):
+        self.set_specaugment(
+            time_wrap_w=0,
+            freq_mask_n=2,
+            freq_mask_f=27,
+            time_mask_n=2,
+            time_mask_t=100,
+            time_mask_p=1.0,
+        )
+    def set_specaugment_sm_policy(self):
+        self.set_specaugment(
+            time_wrap_w=0,
+            freq_mask_n=2,
+            freq_mask_f=15,
+            time_mask_n=2,
+            time_mask_t=70,
+            time_mask_p=0.2,
+        )
+    def set_specaugment_ss_policy(self):
+        self.set_specaugment(
+            time_wrap_w=0,
+            freq_mask_n=2,
+            freq_mask_f=27,
+            time_mask_n=2,
+            time_mask_t=70,
+            time_mask_p=0.2,
+        )
+    def set_input_channels(self, input_channels: int = 1):
+        self.config["input_channels"] = input_channels
+    def set_input_feat_per_channel(self, input_feat_per_channel: int = 80):
+        self.config["input_feat_per_channel"] = input_feat_per_channel
+    def set_bpe_tokenizer(self, bpe_tokenizer: Dict[str, Any]):
+        self.config["bpe_tokenizer"] = bpe_tokenizer
+    def set_global_cmvn(self, stats_npz_path: str):
+        self.config["global_cmvn"] = {"stats_npz_path": stats_npz_path}
+    def set_feature_transforms(self, split: str, transforms: List[str]):
+        if "transforms" not in self.config:
+            self.config["transforms"] = {}
+        self.config["transforms"][split] = transforms
+    def set_prepend_tgt_lang_tag(self, flag: bool = True):
+        self.config["prepend_tgt_lang_tag"] = flag
+    def set_sampling_alpha(self, sampling_alpha: float = 1.0):
+        self.config["sampling_alpha"] = sampling_alpha
+    def set_extra(self, data):
+        self.config.update(data)

models/french_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:10c940349cedf8dd3611e7d585cd36b544f9d7a379328147b96d057292dab359
+size 373015859

models/hindi_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47e8bfef22034ac859da3a2726b142876793113cf18ac18bb6f6eb85415a7893
+size 373227272

prep_mustc_data_hindi_single.py ADDED Viewed

	@@ -0,0 +1,263 @@

+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import logging
+import os
+from pathlib import Path
+import shutil
+from itertools import groupby
+from tempfile import NamedTemporaryFile
+from typing import Tuple
+import numpy as np
+import pandas as pd
+import soundfile as sf
+from examples.speech_to_text.data_utils import (
+    create_zip,
+    extract_fbank_features,
+    filter_manifest_df,
+    gen_config_yaml,
+    gen_vocab,
+    get_zip_manifest,
+    load_df_from_tsv,
+    save_df_to_tsv,
+    cal_gcmvn_stats,
+)
+import torch
+from torch.utils.data import Dataset
+from tqdm import tqdm
+from fairseq.data.audio.audio_utils import get_waveform, convert_waveform
+log = logging.getLogger(__name__)
+MANIFEST_COLUMNS = ["id", "audio", "n_frames", "tgt_text", "speaker"]
+class MUSTC(Dataset):
+    """
+    Create a Dataset for MuST-C. Each item is a tuple of the form:
+    waveform, sample_rate, source utterance, target utterance, speaker_id,
+    utterance_id
+    """
+    SPLITS = ["tst-COMMON"]
+    LANGUAGES = ["de", "es", "fr", "it", "nl", "pt", "ro", "ru", "hi"]
+    def __init__(self, root: str, lang: str, split: str) -> None:
+        assert split in self.SPLITS and lang in self.LANGUAGES
+        _root = Path(root) / f"en-{lang}" / "data" / split
+        wav_root, txt_root = _root / "wav", _root / "txt"
+        #print(_root, wav_root, txt_root)
+        assert _root.is_dir() and wav_root.is_dir() and txt_root.is_dir()
+        # Load audio segments
+        try:
+            import yaml
+        except ImportError:
+            print("Please install PyYAML to load the MuST-C YAML files")
+        with open(txt_root / f"{split}.yaml") as f:
+            segments = yaml.load(f, Loader=yaml.BaseLoader)
+        # Load source and target utterances
+        for _lang in ["en", lang]:
+            with open(txt_root / f"{split}.{_lang}") as f:
+                utterances = [r.strip() for r in f]
+            print(len(segments), len(utterances))
+            assert len(segments) == len(utterances)
+            for i, u in enumerate(utterances):
+                segments[i][_lang] = u
+        # Gather info
+        self.data = []
+        for wav_filename, _seg_group in groupby(segments, lambda x: x["wav"]):
+            wav_path = wav_root / wav_filename
+            sample_rate = sf.info(wav_path.as_posix()).samplerate
+            seg_group = sorted(_seg_group, key=lambda x: x["offset"])
+            for i, segment in enumerate(seg_group):
+                offset = int(float(segment["offset"]) * sample_rate)
+                n_frames = int(float(segment["duration"]) * sample_rate)
+                _id = f"{wav_path.stem}_{i}"
+                self.data.append(
+                    (
+                        wav_path.as_posix(),
+                        offset,
+                        n_frames,
+                        sample_rate,
+                        segment["en"],
+                        segment[lang],
+                        segment["speaker_id"],
+                        _id,
+                    )
+                )
+    def __getitem__(
+            self, n: int
+    ) -> Tuple[torch.Tensor, int, str, str, str, str]:
+        wav_path, offset, n_frames, sr, src_utt, tgt_utt, spk_id, \
+            utt_id = self.data[n]
+        waveform, _ = get_waveform(wav_path, frames=n_frames, start=offset)
+        waveform = torch.from_numpy(waveform)
+        return waveform, sr, src_utt, tgt_utt, spk_id, utt_id
+    def __len__(self) -> int:
+        return len(self.data)
+def process(args):
+    root = Path(args.data_root).absolute()
+    for lang in MUSTC.LANGUAGES:
+        cur_root = root / f"en-{lang}"
+        if not cur_root.is_dir():
+            print(f"{cur_root.as_posix()} does not exist. Skipped.")
+            continue
+        # Extract features
+        audio_root = cur_root / ("flac" if args.use_audio_input else "fbank80")
+        audio_root.mkdir(exist_ok=True)
+        for split in MUSTC.SPLITS:
+            print(f"Fetching split {split}...")
+            dataset = MUSTC(root.as_posix(), lang, split)
+            if args.use_audio_input:
+                print("Converting audios...")
+                for waveform, sample_rate, _, _, _, utt_id in tqdm(dataset):
+                    tgt_sample_rate = 16_000
+                    _wavform, _ = convert_waveform(
+                        waveform, sample_rate, to_mono=True,
+                        to_sample_rate=tgt_sample_rate
+                    )
+                    sf.write(
+                        (audio_root / f"{utt_id}.flac").as_posix(),
+                        _wavform.T.numpy(), tgt_sample_rate
+                    )
+            else:
+                print("Extracting log mel filter bank features...")
+                gcmvn_feature_list = []
+                if split == 'train' and args.cmvn_type == "global":
+                    print("And estimating cepstral mean and variance stats...")
+                for waveform, sample_rate, _, _, _, utt_id in tqdm(dataset):
+                    features = extract_fbank_features(
+                        waveform, sample_rate, audio_root / f"{utt_id}.npy"
+                    )
+                    if split == 'train' and args.cmvn_type == "global":
+                        if len(gcmvn_feature_list) < args.gcmvn_max_num:
+                            gcmvn_feature_list.append(features)
+                if split == 'train' and args.cmvn_type == "global":
+                    # Estimate and save cmv
+                    stats = cal_gcmvn_stats(gcmvn_feature_list)
+                    with open(cur_root / "gcmvn.npz", "wb") as f:
+                        np.savez(f, mean=stats["mean"], std=stats["std"])
+        # Pack features into ZIP
+        zip_path = cur_root / f"{audio_root.name}.zip"
+        print("ZIPing audios/features...")
+        create_zip(audio_root, zip_path)
+        print("Fetching ZIP manifest...")
+        audio_paths, audio_lengths = get_zip_manifest(
+            zip_path,
+            is_audio=args.use_audio_input,
+        )
+        # Generate TSV manifest
+        print("Generating manifest...")
+        train_text = []
+        for split in MUSTC.SPLITS:
+            is_train_split = split.startswith("train")
+            manifest = {c: [] for c in MANIFEST_COLUMNS}
+            dataset = MUSTC(args.data_root, lang, split)
+            for _, _, src_utt, tgt_utt, speaker_id, utt_id in tqdm(dataset):
+                manifest["id"].append(utt_id)
+                manifest["audio"].append(audio_paths[utt_id])
+                manifest["n_frames"].append(audio_lengths[utt_id])
+                manifest["tgt_text"].append(
+                    src_utt if args.task == "asr" else tgt_utt
+                )
+                manifest["speaker"].append(speaker_id)
+            if is_train_split:
+                train_text.extend(manifest["tgt_text"])
+            df = pd.DataFrame.from_dict(manifest)
+            df = filter_manifest_df(df, is_train_split=is_train_split)
+            save_df_to_tsv(df, cur_root / f"{split}_{args.task}.tsv")
+        # Clean up
+        shutil.rmtree(audio_root)
+def process_joint(args):
+    cur_root = Path(args.data_root)
+    assert all(
+        (cur_root / f"en-{lang}").is_dir() for lang in MUSTC.LANGUAGES
+    ), "do not have downloaded data available for all 8 languages"
+    # Generate vocab
+    vocab_size_str = "" if args.vocab_type == "char" else str(args.vocab_size)
+    spm_filename_prefix = f"spm_{args.vocab_type}{vocab_size_str}_{args.task}"
+    with NamedTemporaryFile(mode="w") as f:
+        for lang in MUSTC.LANGUAGES:
+            tsv_path = cur_root / f"en-{lang}" / f"train_{args.task}.tsv"
+            df = load_df_from_tsv(tsv_path)
+            for t in df["tgt_text"]:
+                f.write(t + "\n")
+        special_symbols = None
+        if args.task == 'st':
+            special_symbols = [f'<lang:{lang}>' for lang in MUSTC.LANGUAGES]
+        gen_vocab(
+            Path(f.name),
+            cur_root / spm_filename_prefix,
+            args.vocab_type,
+            args.vocab_size,
+            special_symbols=special_symbols
+        )
+    # Generate config YAML
+    gen_config_yaml(
+        cur_root,
+        spm_filename=spm_filename_prefix + ".model",
+        yaml_filename=f"config_{args.task}.yaml",
+        specaugment_policy="ld",
+        prepend_tgt_lang_tag=(args.task == "st"),
+    )
+    # Make symbolic links to manifests
+    for lang in MUSTC.LANGUAGES:
+        for split in MUSTC.SPLITS:
+            src_path = cur_root / f"en-{lang}" / f"{split}_{args.task}.tsv"
+            desc_path = cur_root / f"{split}_{lang}_{args.task}.tsv"
+            if not desc_path.is_symlink():
+                os.symlink(src_path, desc_path)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-root", "-d", required=True, type=str)
+    parser.add_argument(
+        "--vocab-type",
+        default="unigram",
+        required=True,
+        type=str,
+        choices=["bpe", "unigram", "char"],
+    ),
+    parser.add_argument("--vocab-size", default=8000, type=int)
+    parser.add_argument("--task", type=str, choices=["asr", "st"])
+    parser.add_argument("--joint", action="store_true", help="")
+    parser.add_argument(
+        "--cmvn-type", default="utterance",
+        choices=["global", "utterance"],
+        help="The type of cepstral mean and variance normalization"
+    )
+    parser.add_argument(
+        "--gcmvn-max-num", default=150000, type=int,
+        help="Maximum number of sentences to use to estimate global mean and "
+             "variance"
+        )
+    parser.add_argument("--use-audio-input", action="store_true")
+    args = parser.parse_args()
+    if args.joint:
+        process_joint(args)
+    else:
+        process(args)
+if __name__ == "__main__":
+    main()

s2t_en2hi.py ADDED Viewed

	@@ -0,0 +1,32 @@

+"""
+Script to translate given single english audio file to corresponding hindi text
+Usage : python s2t_en2hi.py <audio_file_path> <averaged_checkpoints_file_path>
+"""
+import sys
+import os
+import subprocess
+# TODO better argument handling
+hi_wav = sys.argv[1]
+en2hi_model_checkpoint = sys.argv[2]
+os.system(f"cp {hi_wav} ./MUSTC_ROOT/en-hi/data/tst-COMMON/wav/test.wav")
+print("------Starting data prepration...")
+subprocess.run(["python", "prep_mustc_data_hindi_single.py", "--data-root", "MUSTC_ROOT/", "--task", "st", "--vocab-type", "unigram", "--vocab-size", "8000"], stdout=subprocess.DEVNULL)
+print("------Performing translation...")
+translation_result = subprocess.run(["fairseq-generate", "./MUSTC_ROOT/en-hi/", "--config-yaml", "config_st.yaml", "--gen-subset", "tst-COMMON_st", "--task", "speech_to_text", "--path", sys.argv[2], "--max-tokens", "50000", "--beam", "5", "--scoring", "sacrebleu"], capture_output=True, text=True)
+translation_result_text = translation_result.stdout
+print(translation_result.std)
+lines = translation_result_text.split("\n")
+print("\n\n------Translation results are:")
+for i in lines:
+    if (i.startswith("D-0")):
+        print(i)
+        break
+os.system("rm ./MUSTC_ROOT/en-hi/data/tst-COMMON/wav/test.wav")

s2t_en2hi_nolog.py ADDED Viewed

	@@ -0,0 +1,32 @@

+"""
+Script to translate given single english audio file to corresponding hindi text
+Usage : python s2t_en2hi.py <audio_file_path> <averaged_checkpoints_file_path>
+"""
+import sys
+import os
+import subprocess
+# TODO better argument handling
+hi_wav = sys.argv[1]
+en2hi_model_checkpoint = sys.argv[2]
+os.system(f"cp {hi_wav} ./MUSTC_ROOT/en-hi/data/tst-COMMON/wav/test.wav")
+print("------Starting data prepration...")
+subprocess.run(["python", "prep_mustc_data_hindi_single.py", "--data-root", "MUSTC_ROOT/", "--task", "st", "--vocab-type", "unigram", "--vocab-size", "8000"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+print("------Performing translation...")
+translation_result = subprocess.run(["fairseq-generate", "./MUSTC_ROOT/en-hi/", "--config-yaml", "config_st.yaml", "--gen-subset", "tst-COMMON_st", "--task", "speech_to_text", "--path", sys.argv[2], "--max-tokens", "50000", "--beam", "5", "--scoring", "sacrebleu"], capture_output=True, text=True)
+translation_result_text = translation_result.stdout
+print(translation_result.std)
+lines = translation_result_text.split("\n")
+print("\n\n------Translation results are:")
+for i in lines:
+    if (i.startswith("D-0")):
+        print(i.split("\t")[2])
+        break
+os.system("rm ./MUSTC_ROOT/en-hi/data/tst-COMMON/wav/test.wav")

test.wav ADDED Viewed

Binary file (141 kB). View file

test2.wav ADDED Viewed

Binary file (126 kB). View file