---
license: cc-by-4.0
language:
- vi
base_model:
- SWivid/F5-TTS
tags:
- tts
- vietnamese
- voice-cloning
---
# 🇻🇳 Vietnamese Text-to-Speech (TTS)

## **Model Description**
This is a **Vietnamese Text-to-Speech (TTS) model** trained to generate natural-sounding Vietnamese speech from text. The model is designed for applications such as virtual assistants, audiobooks, and accessibility tools.

- **Model Name:** `zalopay/vietnamese-tts`
- **Language:** Vietnamese (`vi`)
- **Task:** Text-to-Speech (TTS)
- **Framework:** *F5-TTS*
- **License:** *CC-BY-4.0*

## **Model Architecture**
- F5-TTS uses Diffusion Transformer with ConvNeXt V2, faster trained and inference.

## **Training Data**
- **Dataset:** this model was trained using 200+ hours public Vietnamese Voice and Youtube

### **Inference Example**
```python
from f5_tts.infer.utils_infer import (
    preprocess_ref_audio_text,
    load_vocoder,
    load_model,
    infer_process,
    save_spectrogram,
)


vocoder = load_vocoder()
# dim: 1024
#     depth: 22
#     heads: 16
#     ff_mult: 2
#     text_dim: 512
model = load_model(
    DiT,
    dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4),
    ckpt_path=str(
        cached_path("hf://zalopay/vietnamese-tts/model_960000.pt")
    ),
    mel_spec_type="vocos",
    vocab_file=str(cached_path("hf://zalopay/vietnamese-tts/vocab.txt")),
)

...

ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text)
    gr.Info("Generated audio text: {} with audio file {} ".format(ref_text, ref_audio_orig))
    final_wave, final_sample_rate, combined_spectrogram = infer_process(
        ref_audio,
        ref_text,
        gen_text,
        model,
        vocoder,
        cross_fade_duration=0.15,
        nfe_step=32,
        speed=speed,
    )

```

## **Applications**
- Virtual assistants (e.g., chatbots, AI voice interactions)
- Audiobooks and content narration
- Accessibility tools for visually impaired users
- Automated announcements and voiceovers

## **Limitations & Biases**
- May struggle with uncommon words or names.
- Limited support for different accents or dialects.
- Background noise or pronunciation inconsistencies may occur.
- Duplicated voice may occur 

## **Citation**
If you use this model, please cite:
```bibtex
@misc{zalopay-vietnamese-tts,
  title={Zalopay Vietnamese Text-to-Speech Model},
  author={Zalopay},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/zalopay/vietnamese-tts}
}
```

## **Acknowledgments**
Special thanks to F5-TTS for providing such wonderful base model and framework