drbaph
/

MegaTTS3-WaveVAE

+---
+license: apache-2.0
+tags:
+- text-to-speech
+- tts
+- voice-cloning
+- speech-synthesis
+- pytorch
+- audio
+- chinese
+- english
+- zero-shot
+- diffusion
+library_name: transformers
+pipeline_tag: text-to-speech
+---
+# MegaTTS3-WaveVAE: Complete Voice Cloning Model
+<div align="center">
+  <h3>🚀 <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3>
+  <img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars">
+  <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License">
+  <img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform">
+  <img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language">
+</div>
+## About
+This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.
+**Key Features:**
+- 🎯 Zero-shot voice cloning from any 3-24 second audio sample
+- 🌍 Bilingual: Chinese, English, and code-switching
+- ⚡ Efficient: 0.45B parameter diffusion transformer
+- 🔧 Complete: Includes WaveVAE (missing from original)
+- 🎛️ Controllable: Adjustable voice similarity and clarity
+- 💻 Windows ready: One-click installer available
+## Quick Start
+### Installation
+**[📥 One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection
+Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.
+### Usage Examples
+```bash
+# Basic voice cloning
+python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output
+# Better quality settings
+python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0
+# Web interface (easiest)
+python tts/megatts3_gradio.py
+# Then open http://localhost:7929
+```
+## Model Components
+- **Diffusion Transformer**: 0.45B parameter TTS model
+- **WaveVAE**: High-quality audio encoder/decoder
+- **Aligner**: Speech-text alignment model
+- **G2P**: Grapheme-to-phoneme converter
+## Parameters
+- `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
+- `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
+- **Tip**: Set t_w 0-3 points higher than p_w
+## Requirements
+- Windows 10/11 or Linux
+- Python 3.10
+- 8GB+ RAM, NVIDIA GPU recommended
+- 5GB+ storage space
+## Credits
+- **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
+- **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
+- **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
+- **Windows Implementation**: [Saganaki22](https://github.com/Saganaki22)
+- **Special Thanks**: MysteryShack on Discord
+## Citation
+```bibtex
+@article{jiang2025sparse,
+  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
+  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and others},
+  journal={arXiv preprint arXiv:2502.18924},
+  year={2025}
+}
+```
+---
+*High-quality voice cloning for research and creative applications. Please use responsibly.*