Text-to-Speech
ONNX
Safetensors
English
Chinese
zhu-han commited on
Commit
cb7dc52
·
verified ·
1 Parent(s): 605c704

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -1
README.md CHANGED
@@ -5,6 +5,140 @@ datasets:
5
  language:
6
  - en
7
  - zh
 
 
8
  ---
9
 
10
- ## ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  language:
6
  - en
7
  - zh
8
+ tags:
9
+ - text-to-speech
10
  ---
11
 
12
+ # ZipVoice
13
+
14
+ ## Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching</center>
15
+
16
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](http://arxiv.org/abs/2506.13053)
17
+ [![demo](https://img.shields.io/badge/GitHub-Demo%20page-orange.svg)](https://zipvoice.github.io/)
18
+
19
+ ## Overview
20
+
21
+ ZipVoice is a high-quality zero-shot TTS model with a small model size and fast inference speed.
22
+
23
+ ### 1. Key features
24
+
25
+ - Small and fast: only 123M parameters.
26
+
27
+ - High-quality: state-of-the-art voice cloning performance in speaker similarity, intelligibility, and naturalness.
28
+
29
+ - Multi-lingual: support Chinese and English.
30
+
31
+ ### 2. Architecture
32
+
33
+ <div align="center">
34
+
35
+ <img src="https://zipvoice.github.io/pics/zipvoice.png" width="500" >
36
+
37
+ </div>
38
+
39
+ ## News
40
+
41
+ **2025/06/16**: 🔥 ZipVoice is released.
42
+
43
+ ## Installation
44
+
45
+ ### 1. Clone the ZipVoice repository
46
+
47
+ ```bash
48
+ git clone https://github.com/k2-fsa/ZipVoice.git
49
+ ```
50
+
51
+ ### 2. (Optional) Create a Python virtual environment
52
+
53
+ ```bash
54
+ python3 -m venv zipvoice
55
+ source zipvoice/bin/activate
56
+ ```
57
+
58
+ ### 3. Install the required packages
59
+
60
+ ```bash
61
+ pip install -r requirements.txt
62
+ ```
63
+
64
+ ### 4. (Optional) Install k2 for training or efficient inference:
65
+
66
+ k2 is necessary for training and can speed up inference. Nevertheless, you can still use the inference mode of ZipVoice without installing k2.
67
+
68
+ > **Note:** Make sure to install the k2 version that matches your PyTorch and CUDA version. For example, if you are using pytorch 2.5.1 and CUDA 12.1, you can install k2 as follows:
69
+
70
+ ```bash
71
+ pip install k2==1.24.4.dev20250208+cuda12.1.torch2.5.1 -f https://k2-fsa.github.io/k2/cuda.html
72
+ ```
73
+
74
+ Please refer to https://k2-fsa.org/get-started/k2/ for details.
75
+ Users in China mainland can refer to https://k2-fsa.org/zh-CN/get-started/k2/.
76
+
77
+ ## Usage
78
+
79
+ To generate speech with our pre-trained ZipVoice or ZipVoice-Distill models, use the following commands (Required models will be downloaded from HuggingFace):
80
+
81
+ ### 1. Inference of a single sentence
82
+
83
+ ```bash
84
+ python3 zipvoice/zipvoice_infer.py \
85
+ --model-name "zipvoice" \
86
+ --prompt-wav prompt.wav \
87
+ --prompt-text "I am the transcription of the prompt wav." \
88
+ --text "I am the text to be synthesized." \
89
+ --res-wav-path result.wav
90
+ ```
91
+
92
+ - `--model-name` can be `zipvoice` or `zipvoice_distill`, which are models before and after distillation, respectively.
93
+ - If `<>` or `[]` appear in the text, strings enclosed by them will be treated as special tokens. `<>` denotes Chinese pinyin and `[]` denotes other special tags.
94
+
95
+ ### 2. Inference of a list of sentences
96
+
97
+ ```bash
98
+ python3 zipvoice/zipvoice_infer.py \
99
+ --model-name "zipvoice" \
100
+ --test-list test.tsv \
101
+ --res-dir results/test
102
+ ```
103
+
104
+ - Each line of `test.tsv` is in the format of `{wav_name}\t{prompt_transcription}\t{prompt_wav}\t{text}`.
105
+
106
+ > **Note:** If you have trouble connecting to HuggingFace, try:
107
+ > ```bash
108
+ > export HF_ENDPOINT=https://hf-mirror.com
109
+ > ```
110
+
111
+ ### 3. Correcting mispronounced chinese polyphone characters
112
+
113
+ We use [pypinyin](https://github.com/mozillazg/python-pinyin) to convert Chinese characters to pinyin. However, it can occasionally mispronounce **polyphone characters** (多音字).
114
+
115
+ To manually correct these mispronunciations, enclose the **corrected pinyin** in angle brackets `< >` and include the **tone mark**.
116
+
117
+ **Example:**
118
+
119
+ - Original text: `这把剑长三十公分`
120
+ - Correct the pinyin of `长`: `这把剑<chang2>三十公分`
121
+
122
+ > **Note:** If you want to manually assign multiple pinyins, enclose each pinyin with `<>`, e.g., `这把<jian4><chang2><san1>十公分`
123
+
124
+
125
+ ## Discussion & Communication
126
+
127
+ You can directly discuss on [Github Issues](https://github.com/k2-fsa/ZipVoice/issues).
128
+
129
+ You can also scan the QR code to join our wechat group or follow our wechat official account.
130
+
131
+ | Wechat Group | Wechat Official Account |
132
+ | ------------ | ----------------------- |
133
+ |![wechat](https://k2-fsa.org/zh-CN/assets/pic/wechat_group.jpg) |![wechat](https://k2-fsa.org/zh-CN/assets/pic/wechat_account.jpg) |
134
+
135
+ ## Citation
136
+
137
+ ```bibtex
138
+ @article{zhu2025zipvoice,
139
+ title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
140
+ author={Zhu, Han and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Li, Zhaoqing and Zhuang, Weiji and Lin, Long and Povey, Daniel},
141
+ journal={arXiv preprint arXiv:2506.13053},
142
+ year={2025}
143
+ }
144
+ ```