Text-to-Speech
ONNX
Safetensors
English
Chinese
File size: 3,236 Bytes
bfeaffe
 
e32e868
bfeaffe
cd566f6
bfeaffe
 
 
ae834c6
 
cb7dc52
 
bfeaffe
 
094a5a0
cb7dc52
feb17ad
 
 
 
 
 
 
 
ae834c6
 
094a5a0
cb7dc52
094a5a0
 
 
 
 
 
 
 
 
cb7dc52
 
feb17ad
cb7dc52
 
 
 
 
 
 
 
 
feb17ad
cb7dc52
 
 
 
 
 
 
 
7bb0779
 
 
 
 
 
 
e32e868
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
datasets:
- k2-fsa/OpenDialog
- amphion/Emilia-Dataset
- k2-fsa/TTS_eval_datasets
language:
- en
- zh
license: apache-2.0
pipeline_tag: text-to-speech
tags:
- text-to-speech
---

# ZipVoice⚡: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching</center>

This model consists of checkpoints for two fast and high-quality non-autoregressive zero-shot text-to-speech models:

- **ZipVoice**, for single-speaker speech generation. Details in [paper](https://arxiv.org/abs/2506.13053) and [demo](https://zipvoice.github.io).


- **ZipVoice-Dialog**, for spoken dialogue generation. Details in [paper](https://arxiv.org/abs/2507.09318) and [demo](https://zipvoice-dialog.github.io).

See our Github repository [ZipVoice](https://github.com/k2-fsa/ZipVoice) for instructions on using our models.


## 1. Explanation of each directory

| Directory                      | Model Type                | Training Data                     | Initialized from           |
| :---------------------------- | :-----------------------: | :-------------------------------: | :------------------------: |
| zipvoice                       | ZipVoice                  | Emilia                            | -                          |
| zipvoice_libritts              | ZipVoice                  | LibriTTS                          | -                          |
| zipvoice_distill               | ZipVoice-Distill          | Emilia                            | zipvoice/model.pt          |
| zipvoice_distill_libritts      | ZipVoice-Distill          | LibriTTS                          | zipvoice_libritts/model.pt |
| zipvoice_dialog                | ZipVoice-Dialog           | OpenDialog + in-house dataset     | zipvoice/model.pt          |
| zipvoice_dialog_opendialog     | ZipVoice-Dialog           | OpenDialog                        | zipvoice/model.pt          |
| zipvoice_dialog_stereo         | ZipVoice-Dialog-Stereo    | in-house dataset                  | zipvoice_dialog/model.pt   |


## 2. Discussion & Communication

You can directly discuss on [Github Issues](https://github.com/k2-fsa/ZipVoice/issues).

You can also scan the QR code to join our wechat group or follow our wechat official account.

| Wechat Group | Wechat Official Account |
| ------------ | ----------------------- |
|![wechat](https://k2-fsa.org/zh-CN/assets/pic/wechat_group.jpg) |![wechat](https://k2-fsa.org/zh-CN/assets/pic/wechat_account.jpg) |

## 3. Citation

```bibtex
@article{zhu2025zipvoice,
      title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
      author={Zhu, Han and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Li, Zhaoqing and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2506.13053},
      year={2025}
}

@article{zhu2025zipvoicedialog,
      title={ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching},
      author={Zhu, Han and Kang, Wei and Guo, Liyong and Yao, Zengwei and Kuang, Fangjun and Zhuang, Weiji and Li, Zhaoqing and Han, Zhifeng and Zhang, Dong and Zhang, Xin and Song, Xingchen and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2507.09318},
      year={2025}
}
```