--- license: cc-by-4.0 language: - vi base_model: - SWivid/F5-TTS tags: - tts - vietnamese - voice-cloning --- # 🇻🇳 Vietnamese Text-to-Speech (TTS) ## **Model Description** This is a **Vietnamese Text-to-Speech (TTS) model** trained to generate natural-sounding Vietnamese speech from text. The model is designed for applications such as virtual assistants, audiobooks, and accessibility tools. - **Model Name:** `zalopay/vietnamese-tts` - **Language:** Vietnamese (`vi`) - **Task:** Text-to-Speech (TTS) - **Framework:** *F5-TTS* - **License:** *CC-BY-4.0* ## **Model Architecture** - F5-TTS uses Diffusion Transformer with ConvNeXt V2, faster trained and inference. ## **Training Data** - **Dataset:** this model was trained using 200+ hours public Vietnamese Voice and Youtube ### **Inference Example** ```python from f5_tts.infer.utils_infer import ( preprocess_ref_audio_text, load_vocoder, load_model, infer_process, save_spectrogram, ) vocoder = load_vocoder() # dim: 1024 # depth: 22 # heads: 16 # ff_mult: 2 # text_dim: 512 model = load_model( DiT, dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4), ckpt_path=str( cached_path("hf://zalopay/vietnamese-tts/model_960000.pt") ), mel_spec_type="vocos", vocab_file=str(cached_path("hf://zalopay/vietnamese-tts/vocab.txt")), ) ... ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text) gr.Info("Generated audio text: {} with audio file {} ".format(ref_text, ref_audio_orig)) final_wave, final_sample_rate, combined_spectrogram = infer_process( ref_audio, ref_text, gen_text, model, vocoder, cross_fade_duration=0.15, nfe_step=32, speed=speed, ) ``` ## **Applications** - Virtual assistants (e.g., chatbots, AI voice interactions) - Audiobooks and content narration - Accessibility tools for visually impaired users - Automated announcements and voiceovers ## **Limitations & Biases** - May struggle with uncommon words or names. - Limited support for different accents or dialects. - Background noise or pronunciation inconsistencies may occur. - Duplicated voice may occur ## **Citation** If you use this model, please cite: ```bibtex @misc{zalopay-vietnamese-tts, title={Zalopay Vietnamese Text-to-Speech Model}, author={Zalopay}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/zalopay/vietnamese-tts} } ``` ## **Acknowledgments** Special thanks to F5-TTS for providing such wonderful base model and framework