drbaph commited on
Commit
a5c23c1
Β·
verified Β·
1 Parent(s): b81dad9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - text-to-speech
5
+ - tts
6
+ - voice-cloning
7
+ - speech-synthesis
8
+ - pytorch
9
+ - audio
10
+ - chinese
11
+ - english
12
+ - zero-shot
13
+ - diffusion
14
+ library_name: transformers
15
+ pipeline_tag: text-to-speech
16
+ ---
17
+
18
+ # MegaTTS3-WaveVAE: Complete Voice Cloning Model
19
+
20
+ <div align="center">
21
+ <h3>πŸš€ <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3>
22
+
23
+ <img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars">
24
+ <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License">
25
+ <img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform">
26
+ <img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language">
27
+ </div>
28
+
29
+ ## About
30
+
31
+ This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.
32
+
33
+ **Key Features:**
34
+ - 🎯 Zero-shot voice cloning from any 3-24 second audio sample
35
+ - 🌍 Bilingual: Chinese, English, and code-switching
36
+ - ⚑ Efficient: 0.45B parameter diffusion transformer
37
+ - πŸ”§ Complete: Includes WaveVAE (missing from original)
38
+ - πŸŽ›οΈ Controllable: Adjustable voice similarity and clarity
39
+ - πŸ’» Windows ready: One-click installer available
40
+
41
+ ## Quick Start
42
+
43
+ ### Installation
44
+ **[πŸ“₯ One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection
45
+
46
+ Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.
47
+
48
+ ### Usage Examples
49
+
50
+ ```bash
51
+ # Basic voice cloning
52
+ python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output
53
+
54
+ # Better quality settings
55
+ python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0
56
+
57
+ # Web interface (easiest)
58
+ python tts/megatts3_gradio.py
59
+ # Then open http://localhost:7929
60
+ ```
61
+
62
+ ## Model Components
63
+
64
+ - **Diffusion Transformer**: 0.45B parameter TTS model
65
+ - **WaveVAE**: High-quality audio encoder/decoder
66
+ - **Aligner**: Speech-text alignment model
67
+ - **G2P**: Grapheme-to-phoneme converter
68
+
69
+ ## Parameters
70
+ - `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
71
+ - `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
72
+ - **Tip**: Set t_w 0-3 points higher than p_w
73
+
74
+ ## Requirements
75
+ - Windows 10/11 or Linux
76
+ - Python 3.10
77
+ - 8GB+ RAM, NVIDIA GPU recommended
78
+ - 5GB+ storage space
79
+
80
+ ## Credits
81
+
82
+ - **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
83
+ - **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
84
+ - **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
85
+ - **Windows Implementation**: [Saganaki22](https://github.com/Saganaki22)
86
+ - **Special Thanks**: MysteryShack on Discord
87
+
88
+ ## Citation
89
+
90
+ ```bibtex
91
+ @article{jiang2025sparse,
92
+ title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
93
+ author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and others},
94
+ journal={arXiv preprint arXiv:2502.18924},
95
+ year={2025}
96
+ }
97
+ ```
98
+
99
+ ---
100
+ *High-quality voice cloning for research and creative applications. Please use responsibly.*