|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-to-speech |
|
|
- tts |
|
|
- voice-cloning |
|
|
- speech-synthesis |
|
|
- pytorch |
|
|
- audio |
|
|
- chinese |
|
|
- english |
|
|
- zero-shot |
|
|
- diffusion |
|
|
library_name: transformers |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# MegaTTS3-WaveVAE: Complete Voice Cloning Model |
|
|
|
|
|
<div align="center"> |
|
|
<h3>π <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3> |
|
|
|
|
|
<img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars"> |
|
|
<img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"> |
|
|
<img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform"> |
|
|
<img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language"> |
|
|
</div> |
|
|
|
|
|
## About |
|
|
|
|
|
This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples. |
|
|
|
|
|
**Key Features:** |
|
|
- π― Zero-shot voice cloning from any 3-24 second audio sample |
|
|
- π Bilingual: Chinese, English, and code-switching |
|
|
- β‘ Efficient: 0.45B parameter diffusion transformer |
|
|
- π§ Complete: Includes WaveVAE (missing from original) |
|
|
- ποΈ Controllable: Adjustable voice similarity and clarity |
|
|
- π» Windows ready: One-click installer available |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
**[π₯ One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection |
|
|
|
|
|
Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users. |
|
|
|
|
|
### Usage Examples |
|
|
|
|
|
```bash |
|
|
# Basic voice cloning |
|
|
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output |
|
|
|
|
|
# Better quality settings |
|
|
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0 |
|
|
|
|
|
# Web interface (easiest) |
|
|
python tts/megatts3_gradio.py |
|
|
# Then open http://localhost:7929 |
|
|
``` |
|
|
|
|
|
## Model Components |
|
|
|
|
|
- **Diffusion Transformer**: 0.45B parameter TTS model |
|
|
- **WaveVAE**: High-quality audio encoder/decoder |
|
|
- **Aligner**: Speech-text alignment model |
|
|
- **G2P**: Grapheme-to-phoneme converter |
|
|
|
|
|
## Parameters |
|
|
- `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech |
|
|
- `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference |
|
|
- **Tip**: Set t_w 0-3 points higher than p_w |
|
|
|
|
|
## Requirements |
|
|
- Windows 10/11 or Linux |
|
|
- Python 3.10 |
|
|
- 8GB+ RAM, NVIDIA GPU recommended |
|
|
- 5GB+ storage space |
|
|
|
|
|
## Credits |
|
|
|
|
|
- **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3) |
|
|
- **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0] |
|
|
- **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning) |
|
|
- **Windows Implementation & Complete Package**: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE) |
|
|
- **Special Thanks**: MysteryShack on Discord for model information |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original research: |
|
|
|
|
|
```bibtex |
|
|
@article{jiang2025sparse, |
|
|
title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis}, |
|
|
author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others}, |
|
|
journal={arXiv preprint arXiv:2502.18924}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
@article{ji2024wavtokenizer, |
|
|
title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling}, |
|
|
author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others}, |
|
|
journal={arXiv preprint arXiv:2408.16532}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
*High-quality voice cloning for research and creative applications. Please use responsibly.* |