How to use

See example of inference pipeline for Russian TTS (G2P + FastPitch + HifiGAN) in this notebook. Or use this bash-script.

Input

This model accepts batches of mel spectrograms.

Output

This model outputs audio at 22050Hz.

Training

The NeMo toolkit [1] was used for training the model for several epochs. Full training script is here.

Datasets

This model is trained on RUSLAN [2] corpus (single speaker, male voice) sampled at 22050Hz.

References

  • [1] NVIDIA NeMo Toolkit
  • [2] Gabdrakhmanov L., Garaev R., Razinkov E. (2019) RUSLAN: Russian Spoken Language Corpus for Speech Synthesis. In: Salah A., Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science, vol 11658. Springer, Cham
Downloads last month
16
Inference Examples
Inference API (serverless) does not yet support nemo models for this pipeline type.