harryjulian commited on
Commit
622e1b9
·
verified ·
1 Parent(s): 8272707

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -1
README.md CHANGED
@@ -1,5 +1,151 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- This is a filler model card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: apache-2.0
5
+ tags:
6
+ - audio
7
+ - speech
8
+ - audio-to-audio
9
+ - speechlm
10
+ datasets:
11
+ - amphion/Emilia-Dataset
12
+ - facebook/multilingual_librispeech
13
+ - openslr/librispeech_asr
14
+ - CSTR-Edinburgh/vctk
15
+ - google/fleurs
16
+ - mozilla-foundation/common_voice_13_0
17
+ metrics:
18
+ - wer
19
+
20
  ---
21
 
22
+ # Model Card for NeuCodec
23
+
24
+ <!-- Provide a quick summary of what the model is/does. -->
25
+
26
+ NeuCodec is an FSQ-based audio codec for speech tokenization.
27
+
28
+ ## Model Details
29
+
30
+ <!-- Provide a longer summary of what this model is. -->
31
+
32
+ NeuCodec is an ultra low bit-rate audio codec which takes advantage of the following advances;
33
+
34
+ * It uses both audio (BigCodec encoder) and semantic (Wav2Vec2-BERT-large) information in the encoding process.
35
+ * The quantisation method is FSQ rather than RVQ, resulting in a single vector for the quantised output, which makes it ideal for downstream modeling in SpeechLMs.
36
+ * At 50 tokens/sec and 16 bits per token, the bit-rate is 800 bits/sec.
37
+
38
+ Our work is largely based on the work of HKUSTAudio/xcodec2(https://huggingface.co/HKUSTAudio/xcodec2).
39
+
40
+ - **Developed by:** Neuphonic
41
+ - **Model type:** Neural Audio Codec
42
+ - **Language(s):** English
43
+ - **License:** apache-2.0
44
+
45
+ ### Model Sources
46
+
47
+ <!-- Provide the basic links for the model. -->
48
+
49
+ - **Repository:** https://github.com/neuphonic/neucodec
50
+ - **Paper:** *Coming soon*
51
+
52
+ ## Uses
53
+
54
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
55
+
56
+ ### Direct Use
57
+
58
+ NeuCodec can be used directly to compress audio for fast, low bitrate transmission.
59
+
60
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
61
+
62
+ ### Downstream Use
63
+
64
+ As NeuCodec compression results in a single-vector tokenizable encoding of input audio, NeuCodec tokens are intended to be used as a training target or input to a SpeechLM for tasks such as speech synthesis or speech recognition.
65
+
66
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
67
+
68
+ ## How to Get Started with the Model
69
+
70
+ Use the code below to get started with the model.
71
+
72
+ To install from pypi in a dedicated environment:
73
+
74
+ ```bash
75
+ conda create -n neucodec python>3.9
76
+ conda activate neucodec
77
+ pip install neucodec
78
+ ```
79
+ Then, to use in python:
80
+
81
+ ```python
82
+ import torch
83
+ import soundfile as sf
84
+ from transformers import AutoConfig
85
+ from neucodec import NeuCodec
86
+
87
+ model_path = "Neuphonic/neucodec"
88
+
89
+ model = NeuCodec.from_pretrained(model_path)
90
+ model.eval().cuda()
91
+
92
+ wav, sr = sf.read("test.wav")
93
+ wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Shape: (1, T)
94
+
95
+ with torch.no_grad():
96
+ vq_code = model.encode_code(input_waveform=wav_tensor)
97
+ print("Codes: ", vq_code)
98
+ recon_wav = model.decode_code(vq_code).cpu() # Shape: (1, 1, T')
99
+
100
+
101
+ sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
102
+ ```
103
+
104
+ ## Training Details
105
+
106
+ ### Training Data
107
+
108
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
109
+
110
+ The model was trained on a mix of publicly available and proprietary data. The publicly available data includes the English segments of Emilia-YODAS, MLS, LibriTTS, Fleurs, CommonVoice, and HUI.
111
+
112
+ ### Training Procedure
113
+
114
+ The model was trained for 800k steps on one 8xH100 node with an effective batch size of 64.
115
+
116
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
117
+
118
+ ## Evaluation
119
+
120
+ <!-- This section describes the evaluation protocols and provides the results. -->
121
+
122
+ ### Testing Data, Factors & Metrics
123
+
124
+ #### Testing Data
125
+
126
+ CMU-Arctic
127
+
128
+ <!-- This should link to a Dataset Card if possible. -->
129
+
130
+ #### Metrics
131
+
132
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
133
+
134
+ As we are interested the the degree of distortion from the unencoded to reconstructed audio, our evaluation metrics include. PESQ, STOI, SI-SDR, Mel-Spectrogram MSE, and diff WER.
135
+
136
+ ### Results
137
+
138
+ | Codec | Quantizer Token Rate | Tokens Per Second | Bitrate | Codebook size | Quantizers | Params | Autoencoding RTF | Decoding RTF | WER (%) | CER (%) |
139
+ | -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- |
140
+ | DAC | 75 | 600 | 6kbps | 1024 | 8 | 74.7 | 0.015 | 0.007 | 1.9 | 0.06 |
141
+ | Mimi | 12.5 | 150 |1.1kbps | 2k | 8| 79.3| 0.012| 0.006| 3.0| 1.4 |
142
+ | NeuCodec | 50 | 50| 0.8kbps | 65k| 1| 800| 0.030| 0.003| 2.5| 1.0|
143
+
144
+ ## Citation [optional]
145
+
146
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
147
+
148
+ **BibTeX:**
149
+
150
+ Coming Soon
151
+