glaswegian-tts-demo

Runtime error

App Files Files Community

divakaivan commited on May 18, 2024

Commit

8bc0d0f

verified ·

1 Parent(s): 5f151ed

Update app.py

Browse files

Files changed (1) hide show

app.py +11 -10

app.py CHANGED Viewed

@@ -6,12 +6,12 @@ import torch
 from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
-checkpoint = "divakaivan/glaswegian_tts"
 processor = SpeechT5Processor.from_pretrained(checkpoint)
 model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)
 vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
-# .
 speaker_embeddings = {
     "BDL": "spkemb/cmu_us_bdl_arctic-wav-arctic_a0009.npy",
     "CLB": "spkemb/cmu_us_clb_arctic-wav-arctic_a0144.npy",
@@ -62,14 +62,10 @@ title = "SpeechT5: Speech Synthesis"
 description = """
 The <b>SpeechT5</b> model is pre-trained on text as well as speech inputs, with targets that are also a mix of text and speech.
 By pre-training on text and speech at the same time, it learns unified representations for both, resulting in improved modeling capabilities.
 SpeechT5 can be fine-tuned for different speech tasks. This space demonstrates the <b>text-to-speech</b> (TTS) checkpoint for the English language.
 See also the <a href="https://huggingface.co/spaces/Matthijs/speecht5-asr-demo">speech recognition (ASR) demo</a>
 and the <a href="https://huggingface.co/spaces/Matthijs/speecht5-vc-demo">voice conversion demo</a>.
 Refer to <a href="https://colab.research.google.com/drive/1i7I5pzBcU3WDFarDnzweIj4-sVVoIUFJ">this Colab notebook</a> to learn how to fine-tune the SpeechT5 TTS model on your own dataset or language.
 <b>How to use:</b> Enter some English text and choose a speaker. The output is a mel spectrogram, which is converted to a mono 16 kHz waveform by the
 HiFi-GAN vocoder. Because the model always applies random dropout, each attempt will give slightly different results.
 The <em>Surprise Me!</em> option creates a completely randomized speaker.
@@ -77,11 +73,9 @@ The <em>Surprise Me!</em> option creates a completely randomized speaker.
 article = """
 <div style='margin:20px auto;'>
 <p>References: <a href="https://arxiv.org/abs/2110.07205">SpeechT5 paper</a> |
 <a href="https://github.com/microsoft/SpeechT5/">original GitHub</a> |
 <a href="https://huggingface.co/mechanicalsea/speecht5-tts">original weights</a></p>
 <pre>
 @article{Ao2021SpeechT5,
   title   = {SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing},
@@ -92,9 +86,7 @@ article = """
   year={2021}
 }
 </pre>
 <p>Speaker embeddings were generated from <a href="http://www.festvox.org/cmu_arctic/">CMU ARCTIC</a> using <a href="https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py">this script</a>.</p>
 </div>
 """
@@ -111,6 +103,15 @@ gr.Interface(
     fn=predict,
     inputs=[
         gr.Text(label="Input Text"),
     ],
     outputs=[
         gr.Audio(label="Generated Speech", type="numpy"),

 from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
+checkpoint = "microsoft/speecht5_tts"
 processor = SpeechT5Processor.from_pretrained(checkpoint)
 model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)
 vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
 speaker_embeddings = {
     "BDL": "spkemb/cmu_us_bdl_arctic-wav-arctic_a0009.npy",
     "CLB": "spkemb/cmu_us_clb_arctic-wav-arctic_a0144.npy",
 description = """
 The <b>SpeechT5</b> model is pre-trained on text as well as speech inputs, with targets that are also a mix of text and speech.
 By pre-training on text and speech at the same time, it learns unified representations for both, resulting in improved modeling capabilities.
 SpeechT5 can be fine-tuned for different speech tasks. This space demonstrates the <b>text-to-speech</b> (TTS) checkpoint for the English language.
 See also the <a href="https://huggingface.co/spaces/Matthijs/speecht5-asr-demo">speech recognition (ASR) demo</a>
 and the <a href="https://huggingface.co/spaces/Matthijs/speecht5-vc-demo">voice conversion demo</a>.
 Refer to <a href="https://colab.research.google.com/drive/1i7I5pzBcU3WDFarDnzweIj4-sVVoIUFJ">this Colab notebook</a> to learn how to fine-tune the SpeechT5 TTS model on your own dataset or language.
 <b>How to use:</b> Enter some English text and choose a speaker. The output is a mel spectrogram, which is converted to a mono 16 kHz waveform by the
 HiFi-GAN vocoder. Because the model always applies random dropout, each attempt will give slightly different results.
 The <em>Surprise Me!</em> option creates a completely randomized speaker.
 article = """
 <div style='margin:20px auto;'>
 <p>References: <a href="https://arxiv.org/abs/2110.07205">SpeechT5 paper</a> |
 <a href="https://github.com/microsoft/SpeechT5/">original GitHub</a> |
 <a href="https://huggingface.co/mechanicalsea/speecht5-tts">original weights</a></p>
 <pre>
 @article{Ao2021SpeechT5,
   title   = {SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing},
   year={2021}
 }
 </pre>
 <p>Speaker embeddings were generated from <a href="http://www.festvox.org/cmu_arctic/">CMU ARCTIC</a> using <a href="https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py">this script</a>.</p>
 </div>
 """
     fn=predict,
     inputs=[
         gr.Text(label="Input Text"),
+        gr.Radio(label="Speaker", choices=[
+            "BDL (male)",
+            "CLB (female)",
+            "KSP (male)",
+            "RMS (male)",
+            "SLT (female)",
+            "Surprise Me!"
+        ],
+        value="BDL (male)"),
     ],
     outputs=[
         gr.Audio(label="Generated Speech", type="numpy"),