How to use

See example of inference pipeline for Russian TTS (G2P + FastPitch + HifiGAN) in this notebook. Or use this bash-script.

Input

This model accepts batches of mel spectrograms.

Output

This model outputs audio at 22050Hz.

Training

The NeMo toolkit [1] was used for training the model for several epochs. Full training script is here.

Datasets

This model is trained on RUSLAN [2] corpus (single speaker, male voice) sampled at 22050Hz.

References

  • [1] NVIDIA NeMo Toolkit
  • [2] Gabdrakhmanov L., Garaev R., Razinkov E. (2019) RUSLAN: Russian Spoken Language Corpus for Speech Synthesis. In: Salah A., Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science, vol 11658. Springer, Cham
Downloads last month
16
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the HF Inference API does not support nemo models with pipeline type text-to-speech