MegaTTS3-WaveVAE / README.md

Update README.md

e1e4a7a verified 4 months ago

4.02 kB

	---
	license: apache-2.0
	tags:
	- text-to-speech
	- tts
	- voice-cloning
	- speech-synthesis
	- pytorch
	- audio
	- chinese
	- english
	- zero-shot
	- diffusion
	library_name: transformers
	pipeline_tag: text-to-speech
	---

	# MegaTTS3-WaveVAE: Complete Voice Cloning Model

	<div align="center">
	<h3>🚀 <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3>

	<img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars">
	<img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License">
	<img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform">
	<img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language">
	</div>

	## About

	This is a complete MegaTTS3 model with WaveVAE support for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.

	Key Features:
	- 🎯 Zero-shot voice cloning from any 3-24 second audio sample
	- 🌍 Bilingual: Chinese, English, and code-switching
	- ⚡ Efficient: 0.45B parameter diffusion transformer
	- 🔧 Complete: Includes WaveVAE (missing from original)
	- 🎛️ Controllable: Adjustable voice similarity and clarity
	- 💻 Windows ready: One-click installer available

	## Quick Start

	### Installation
	[📥 One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer) - Automated setup with GPU detection

	Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.

	### Usage Examples

	```bash
	# Basic voice cloning
	python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output

	# Better quality settings
	python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0

	# Web interface (easiest)
	python tts/megatts3_gradio.py
	# Then open http://localhost:7929
	```

	## Model Components

	- Diffusion Transformer: 0.45B parameter TTS model
	- WaveVAE: High-quality audio encoder/decoder
	- Aligner: Speech-text alignment model
	- G2P: Grapheme-to-phoneme converter

	## Parameters
	- `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
	- `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
	- Tip: Set t_w 0-3 points higher than p_w

	## Requirements
	- Windows 10/11 or Linux
	- Python 3.10
	- 8GB+ RAM, NVIDIA GPU recommended
	- 5GB+ storage space

	## Credits

	- Original MegaTTS3: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
	- WaveVAE Model: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
	- Additional Components: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
	- Windows Implementation & Complete Package: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE)
	- Special Thanks: MysteryShack on Discord for model information

	## Citation

	If you use this model, please cite the original research:

	```bibtex
	@article{jiang2025sparse,
	title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
	author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
	journal={arXiv preprint arXiv:2502.18924},
	year={2025}
	}

	@article{ji2024wavtokenizer,
	title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
	author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
	journal={arXiv preprint arXiv:2408.16532},
	year={2024}
	}
	```

	---
	High-quality voice cloning for research and creative applications. Please use responsibly.