harryjulian commited on
Commit
6a1b3c7
·
verified ·
1 Parent(s): 207397c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -53
README.md CHANGED
@@ -1,12 +1,10 @@
1
  ---
2
- language:
3
- - en
4
  license: apache-2.0
5
  tags:
6
  - audio
7
  - speech
8
  - audio-to-audio
9
- - speechlm
10
  datasets:
11
  - amphion/Emilia-Dataset
12
  - facebook/multilingual_librispeech
@@ -14,40 +12,25 @@ datasets:
14
  - CSTR-Edinburgh/vctk
15
  - google/fleurs
16
  - mozilla-foundation/common_voice_13_0
17
- metrics:
18
- - wer
19
-
20
  ---
21
 
22
- # Model Card for NeuCodec
23
-
24
- <!-- Provide a quick summary of what the model is/does. -->
25
-
26
- NeuCodec is an FSQ-based audio codec for speech tokenization.
27
-
28
- ## Model Details
29
-
30
- <!-- Provide a longer summary of what this model is. -->
31
 
32
- NeuCodec is an ultra low bit-rate audio codec which takes advantage of the following advances;
 
33
 
34
- * It uses both audio (BigCodec encoder) and semantic (Wav2Vec2-BERT-large) information in the encoding process.
35
- * We make use of Finite Scalar Quantisation (FSQ) resulting in a single vector for the quantised output, which makes it ideal for downstream modeling in SpeechLMs.
36
  * At 50 tokens/sec and 16 bits per token, the overall bit-rate is 0.8kbps.
 
37
 
38
  Our work largely based on extending the work of [X-Codec2.0](https://huggingface.co/HKUSTAudio/xcodec2).
39
 
40
  - **Developed by:** Neuphonic
41
  - **Model type:** Neural Audio Codec
42
- - **Language(s):** English
43
  - **License:** apache-2.0
44
-
45
- ### Model Sources
46
-
47
- <!-- Provide the basic links for the model. -->
48
-
49
  - **Repository:** https://github.com/neuphonic/neucodec
50
- - **Paper:** *Coming soon*
51
 
52
  ## Get Started
53
 
@@ -82,44 +65,27 @@ with torch.no_grad():
82
  print(f"Codes shape: {fsq_codes.shape}")
83
  recon = model.decode_code(fsq_codes).cpu() # (B, 1, T_24)
84
 
85
- sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
86
  ```
87
 
88
  ## Training Details
89
 
90
- The model was trained on a mix of publicly available and proprietary data. The publicly available data includes the English segments of Emilia-YODAS, MLS, LibriTTS, Fleurs, CommonVoice, and HUI. All publically available data was covered by either the CC-BY-4.0 or CC0 license.
91
-
92
- ## Evaluation
93
-
94
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
95
 
96
- ### Testing Data, Factors & Metrics
97
 
98
- #### Testing Data
99
 
100
- CMU-Arctic
101
-
102
- <!-- This should link to a Dataset Card if possible. -->
103
-
104
- #### Metrics
105
-
106
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
107
-
108
- As we are interested the the degree of distortion from the unencoded to reconstructed audio, our evaluation metrics include. PESQ, STOI, SI-SDR, Mel-Spectrogram MSE, and diff WER.
109
-
110
- ### Results
111
 
112
  | Codec | Quantizer Token Rate | Tokens Per Second | Bitrate | Codebook size | Quantizers | Params | Autoencoding RTF | Decoding RTF | WER (%) | CER (%) |
113
  | -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- |
114
  | DAC | 75 | 600 | 6kbps | 1024 | 8 | 74.7 | 0.015 | 0.007 | 1.9 | 0.06 |
115
  | Mimi | 12.5 | 150 |1.1kbps | 2k | 8| 79.3| 0.012| 0.006| 3.0| 1.4 |
116
  | NeuCodec | 50 | 50| 0.8kbps | 65k| 1| 800| 0.030| 0.003| 2.5| 1.0|
117
-
118
- ## Citation [optional]
119
-
120
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
121
-
122
- **BibTeX:**
123
-
124
- Coming Soon
125
-
 
1
  ---
 
 
2
  license: apache-2.0
3
  tags:
4
  - audio
5
  - speech
6
  - audio-to-audio
7
+ - speech-language-models
8
  datasets:
9
  - amphion/Emilia-Dataset
10
  - facebook/multilingual_librispeech
 
12
  - CSTR-Edinburgh/vctk
13
  - google/fleurs
14
  - mozilla-foundation/common_voice_13_0
 
 
 
15
  ---
16
 
17
+ # Model Details
 
 
 
 
 
 
 
 
18
 
19
+ NeuCodec is a Finite Scalar Quantisation (FSQ) based 0.8kbps audio codec for speech tokenization.
20
+ It takes advantage of the following features:
21
 
22
+ * It uses both audio (BigCodec) and semantic ([Wav2Vec2-BERT](https://huggingface.co/facebook/w2v-bert-2.0)) encoders.
23
+ * We make use of Finite Scalar Quantisation (FSQ) resulting in a single vector for the quantised output, which makes it ideal for downstream modeling with Speech Language Models.
24
  * At 50 tokens/sec and 16 bits per token, the overall bit-rate is 0.8kbps.
25
+ * The codec takes in 16kHz input and outputs 24kHz using an upsampling decoder.
26
 
27
  Our work largely based on extending the work of [X-Codec2.0](https://huggingface.co/HKUSTAudio/xcodec2).
28
 
29
  - **Developed by:** Neuphonic
30
  - **Model type:** Neural Audio Codec
 
31
  - **License:** apache-2.0
 
 
 
 
 
32
  - **Repository:** https://github.com/neuphonic/neucodec
33
+ - **Paper:** *Coming soon!*
34
 
35
  ## Get Started
36
 
 
65
  print(f"Codes shape: {fsq_codes.shape}")
66
  recon = model.decode_code(fsq_codes).cpu() # (B, 1, T_24)
67
 
68
+ sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), 24_000)
69
  ```
70
 
71
  ## Training Details
72
 
73
+ The model was trained using the following data:
74
+ * Emilia-YODAS
75
+ * MLS
76
+ * LibriTTS
77
+ * Fleurs
78
+ * CommonVoice
79
+ * HUI
80
+ * Additional proprietary set
81
 
82
+ All publically available data was covered by either the CC-BY-4.0 or CC0 license.
83
 
 
84
 
85
+ ### Benchmarks
 
 
 
 
 
 
 
 
 
 
86
 
87
  | Codec | Quantizer Token Rate | Tokens Per Second | Bitrate | Codebook size | Quantizers | Params | Autoencoding RTF | Decoding RTF | WER (%) | CER (%) |
88
  | -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- |
89
  | DAC | 75 | 600 | 6kbps | 1024 | 8 | 74.7 | 0.015 | 0.007 | 1.9 | 0.06 |
90
  | Mimi | 12.5 | 150 |1.1kbps | 2k | 8| 79.3| 0.012| 0.006| 3.0| 1.4 |
91
  | NeuCodec | 50 | 50| 0.8kbps | 65k| 1| 800| 0.030| 0.003| 2.5| 1.0|