jiaqili3 commited on
Commit
b5d3158
·
verified ·
1 Parent(s): 73faed6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -34
README.md CHANGED
@@ -1,6 +1,23 @@
 
 
1
  # DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
2
 
 
 
 
 
 
 
 
3
  ## About
 
 
 
 
 
 
 
 
4
 
5
  ## Installation
6
  ```bash
@@ -8,8 +25,10 @@ pip install dualcodec
8
  ```
9
 
10
  ## News
11
- - 2025-01-22: I added training and finetuning instructions for DualCodec, version is v0.3.0.
12
- - 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0.
 
 
13
 
14
  ## Available models
15
  <!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
@@ -22,8 +41,37 @@ pip install dualcodec
22
 
23
 
24
  ## How to inference DualCodec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- ### 1. Download checkpoints to local:
 
 
 
 
 
 
27
  ```
28
  # export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
29
  huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
@@ -31,7 +79,7 @@ huggingface-cli download amphion/dualcodec dualcodec_12hz_16384_4096.safetensors
31
  ```
32
  The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory `dualcodec_ckpts`.
33
 
34
- ### 2. To inference an audio in a python script:
35
  ```python
36
  import dualcodec
37
 
@@ -40,7 +88,7 @@ dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
40
  model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"
41
 
42
  dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
43
- inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")
44
 
45
  # do inference for your wav
46
  import torchaudio
@@ -48,13 +96,14 @@ audio, sr = torchaudio.load("YOUR_WAV.wav")
48
  # resample to 24kHz
49
  audio = torchaudio.functional.resample(audio, sr, 24000)
50
  audio = audio.reshape(1,1,-1)
 
51
  # extract codes, for example, using 8 quantizers here:
52
- semantic_codes, acoustic_codes = inference.encode(audio, n_quantizers=8)
53
  # semantic_codes shape: torch.Size([1, 1, T])
54
  # acoustic_codes shape: torch.Size([1, n_quantizers-1, T])
55
 
56
- # produce output audio
57
- out_audio = dualcodec_model.decode_from_codes(semantic_codes, acoustic_codes)
58
 
59
  # save output audio
60
  torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
@@ -62,64 +111,141 @@ torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
62
 
63
  See "example.ipynb" for a running example.
64
 
65
- ## DualCodec-based TTS models
66
- ### DualCodec-based TTS
 
67
 
68
- ## Benchmark results
69
- ### DualCodec audio quality
70
- ### DualCodec-based TTS
 
 
 
71
 
72
- ## Finetuning DualCodec
73
- 1. Install other necessary components for training:
 
 
 
 
74
  ```bash
75
- pip install "dualcodec[train]"
 
 
 
 
76
  ```
77
- 2. Clone this repository and `cd` to project root folder.
78
 
79
- 3. Get discriminator checkpoints:
 
 
80
  ```bash
81
- huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
82
  ```
 
83
 
84
- 4. To run example training on Emilia German data (streaming, no need to download files. Need to access Huggingface):
85
  ```bash
86
- accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
87
- trainer.batch_size=3 \
88
- data.segment_speech.segment_length=24000
89
  ```
90
- This trains from scratch a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes)
91
 
92
- To finetune a 25Hz_V1 model:
 
93
  ```bash
94
- accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
95
- trainer.batch_size=3 \
96
- data.segment_speech.segment_length=24000
 
 
 
 
 
 
 
97
  ```
98
 
99
 
100
  ## Training DualCodec from scratch
101
  1. Install other necessary components for training:
102
  ```bash
103
- pip install dualcodec[train]
 
 
 
 
 
104
  ```
105
- 2. Clone this repository and `cd` to project root folder.
106
 
107
  3. To run example training on example Emilia German data:
108
  ```bash
109
- accelerate launch train.py --config-name=codec_train \
110
  model=dualcodec_12hz_16384_4096_8vq \
111
  trainer.batch_size=3 \
112
  data.segment_speech.segment_length=24000
113
  ```
114
- This trains from scratch a dualcodec_12hz_16384_4096_8vq model with a training batch size of 3. (typically you need larger batch sizes)
115
 
116
- To train a 25Hz model:
117
  ```bash
118
- accelerate launch train.py --config-name=codec_train \
119
  model=dualcodec_25hz_16384_1024_12vq \
120
  trainer.batch_size=3 \
121
  data.segment_speech.segment_length=24000
122
 
123
  ```
124
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
  # DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
4
 
5
+ [![arXiv](https://img.shields.io/badge/arXiv-2505.13000-brightgreen.svg?style=flat-square)](http://arxiv.org/abs/2505.13000)
6
+ [![githubio](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://dualcodec.github.io/)
7
+ [![PyPI](https://img.shields.io/pypi/v/dualcodec?color=blue&label=PyPI&logo=PyPI&style=flat-square)](https://pypi.org/project/dualcodec/)
8
+ [![GitHub](https://img.shields.io/badge/Github-Dev_Release-pink?logo=Github&style=flat-square)](https://github.com/jiaqili3/dualcodec)
9
+ [![Amphion](https://img.shields.io/badge/Amphion-Stable_Release-blue?style=flat-square)](https://github.com/open-mmlab/Amphion/blob/main/models/codec/dualcodec/README.md)
10
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
11
+
12
  ## About
13
+ DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
14
+
15
+ You can check out its [demo page](https://dualcodec.github.io/).
16
+ The overview of DualCodec system is shown in the following figure:
17
+
18
+ <!-- show dualcodec.png -->
19
+ ![DualCodec](dualcodec.png)
20
+
21
 
22
  ## Installation
23
  ```bash
 
25
  ```
26
 
27
  ## News
28
+ - 2025-05-19: DualCodec is accepted to Interspeech 2025!
29
+ - 2025-03-30: Added automatic downloading from huggingface. Uploaded some TTS models (DualCodec-VALLE, DualCodec-Voicebox).
30
+ - 2025-01-22: I added training and finetuning instructions for DualCodec, as well as a gradio interface. Version is v0.3.0.
31
+ - 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
32
 
33
  ## Available models
34
  <!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
 
41
 
42
 
43
  ## How to inference DualCodec
44
+ ### 1. Programmic usage (automatically downloads checkpoints from Huggingface):
45
+ ```python
46
+ import dualcodec
47
+
48
+ model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"
49
+
50
+ dualcodec_model = dualcodec.get_model(model_id)
51
+ dualcodec_inference = dualcodec.Inference(dualcodec_model=dualcodec_model, device="cuda")
52
+
53
+ # do inference for your wav
54
+ import torchaudio
55
+ audio, sr = torchaudio.load("YOUR_WAV.wav")
56
+ # resample to 24kHz
57
+ audio = torchaudio.functional.resample(audio, sr, 24000)
58
+ audio = audio.reshape(1,1,-1)
59
+ audio = audio.to("cuda")
60
+ # extract codes, for example, using 8 quantizers here:
61
+ semantic_codes, acoustic_codes = dualcodec_inference.encode(audio, n_quantizers=8)
62
+ # semantic_codes shape: torch.Size([B, 1, T])
63
+ # acoustic_codes shape: torch.Size([B, n_quantizers-1, T])
64
+
65
+ # produce output audio
66
+ out_audio = dualcodec_inference.decode(semantic_codes, acoustic_codes)
67
 
68
+ # save output audio
69
+ torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
70
+ ```
71
+
72
+
73
+ ### 2. Alternative usage with local checkpoints
74
+ First, download checkpoints to local:
75
  ```
76
  # export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
77
  huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
 
79
  ```
80
  The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory `dualcodec_ckpts`.
81
 
82
+ Then you can use the following code to inference DualCodec with local checkpoints.
83
  ```python
84
  import dualcodec
85
 
 
88
  model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"
89
 
90
  dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
91
+ dualcodec_inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")
92
 
93
  # do inference for your wav
94
  import torchaudio
 
96
  # resample to 24kHz
97
  audio = torchaudio.functional.resample(audio, sr, 24000)
98
  audio = audio.reshape(1,1,-1)
99
+ audio = audio.to("cuda")
100
  # extract codes, for example, using 8 quantizers here:
101
+ semantic_codes, acoustic_codes = dualcodec_inference.encode(audio, n_quantizers=8)
102
  # semantic_codes shape: torch.Size([1, 1, T])
103
  # acoustic_codes shape: torch.Size([1, n_quantizers-1, T])
104
 
105
+ # produce output audio. If `acoustic_codes=None` is passed, will decode only semantic codes (RVQ-1)
106
+ out_audio = dualcodec_inference.decode(semantic_codes, acoustic_codes)
107
 
108
  # save output audio
109
  torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
 
111
 
112
  See "example.ipynb" for a running example.
113
 
114
+ ### 3. Google Colab
115
+ The notebook provides a demo of reconstructing audios using different number of RVQ layers:
116
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
117
 
118
+ ### 4. Gradio interface
119
+ If you want to use the Gradio interface, you can run the following command:
120
+ ```bash
121
+ python -m dualcodec.app
122
+ ```
123
+ This will launch an app that allows you to upload a wav file and get the output wav file.
124
 
125
+ ## DualCodec-based TTS models
126
+ Models available:
127
+ - DualCodec-VALLE: A super fast 12.5Hz VALL-E TTS model based on DualCodec.
128
+ - DualCodec-Voicebox: A flow matching decoder for DualCodec 12.5Hz's semantic codes. (this can be used as the second stage of tts). The component alone is not a TTS.
129
+
130
+ To continue, first install other necessary components for training:
131
  ```bash
132
+ pip install "dualcodec[tts]"
133
+ ```
134
+ Alternatively, if you want to install from source,
135
+ ```bash
136
+ pip install -e .[tts]
137
  ```
 
138
 
139
+ ### DualCodec-VALLE
140
+ DualCodec-VALLE is a TTS model based on DualCodec. It is trained with 12Hz sampling rate and 8 quantizers. The model is trained on 100K hours of Emilia data.
141
+ #### CLI Inference
142
  ```bash
143
+ python -m dualcodec.infer.valle.cli_valle_infer --ref_audio <path_to_ref_audio> --ref_text "TEXT OF YOUR REF AUDIO" --gen_text "This is the generated text" --output_dir test --output_file test.wav
144
  ```
145
+ You can also leave all options empty and it will use the default values.
146
 
147
+ #### Gradio interface
148
  ```bash
149
+ python -m dualcodec.infer.valle.gradio_valle_demo
 
 
150
  ```
 
151
 
152
+ ### DualCodec-Voicebox
153
+ #### CLI Inference
154
  ```bash
155
+ python -m dualcodec.infer.voicebox.cli_voicebox_infer --ref_audio <path_to_ref_audio> --output_dir test --output_file test.wav
156
+ ```
157
+ You can also leave all options empty and it will use the default values.
158
+
159
+
160
+
161
+ ### FAQ
162
+ If you meet problems with environment in this stage, try the following:
163
+ ```
164
+ pip install -U wandb protobuf transformers
165
  ```
166
 
167
 
168
  ## Training DualCodec from scratch
169
  1. Install other necessary components for training:
170
  ```bash
171
+ pip install "dualcodec[tts]"
172
+ ```
173
+ 2. Clone this repository and `cd` to the project root folder (the folder that contains this readme):
174
+ ```bash
175
+ git clone https://github.com/jiaqili3/DualCodec.git
176
+ cd DualCodec
177
  ```
 
178
 
179
  3. To run example training on example Emilia German data:
180
  ```bash
181
+ accelerate launch train.py --config-name=dualcodec_train \
182
  model=dualcodec_12hz_16384_4096_8vq \
183
  trainer.batch_size=3 \
184
  data.segment_speech.segment_length=24000
185
  ```
186
+ This trains from scratch a v1_12hz model with a training batch size of 3. (typically you need larger batch sizes like 10)
187
 
188
+ To train a v1_25Hz model:
189
  ```bash
190
+ accelerate launch train.py --config-name=dualcodec_train \
191
  model=dualcodec_25hz_16384_1024_12vq \
192
  trainer.batch_size=3 \
193
  data.segment_speech.segment_length=24000
194
 
195
  ```
196
 
197
+
198
+
199
+ ## Finetuning DualCodec
200
+ 1. Install other necessary components for training:
201
+ ```bash
202
+ pip install "dualcodec[train]"
203
+ ```
204
+ 2. Clone this repository and `cd` to the project root folder (the folder that contains this readme).
205
+
206
+ 3. Get discriminator checkpoints:
207
+ ```bash
208
+ huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
209
+ ```
210
+
211
+ 4. To run example finetuning on Emilia German data (streaming, no need to download files. Need network access to Huggingface):
212
+ ```bash
213
+ accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
214
+ trainer.batch_size=3 \
215
+ data.segment_speech.segment_length=24000
216
+ ```
217
+ This finetunes a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes like 10)
218
+
219
+ To finetune a 25Hz_V1 model:
220
+ ```bash
221
+ accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
222
+ trainer.batch_size=3 \
223
+ data.segment_speech.segment_length=24000
224
+ ```
225
+
226
+
227
  ## Citation
228
+ ```
229
+ @inproceedings{dualcodec,
230
+ title = {DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation},
231
+ author = {Li, Jiaqi and Lin, Xiaolong and Li, Zhekai and Huang, Shixi and Wang, Yuancheng and Wang, Chaoren and Zhan, Zhenpeng and Wu, Zhizheng},
232
+ booktitle = {Proceedings of Interspeech 2025},
233
+ year = {2025}
234
+ }
235
+ ```
236
+ If you use this with Amphion toolkit, please consider citing:
237
+ ```bibtex
238
+ @article{amphion2,
239
+ title = {Overview of the Amphion Toolkit (v0.2)},
240
+ author = {Jiaqi Li and Xueyao Zhang and Yuancheng Wang and Haorui He and Chaoren Wang and Li Wang and Huan Liao and Junyi Ao and Zeyu Xie and Yiqiao Huang and Junan Zhang and Zhizheng Wu},
241
+ year = {2025},
242
+ journal = {arXiv preprint arXiv:2501.15442},
243
+ }
244
+
245
+ @inproceedings{amphion,
246
+ author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
247
+ title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
248
+ booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
249
+ year={2024}
250
+ }
251
+ ```