neuphonic
/

neucodec

 ---
+language:
+- en
 license: apache-2.0
+tags:
+- audio
+- speech
+- audio-to-audio
+- speechlm
+datasets:
+- amphion/Emilia-Dataset
+- facebook/multilingual_librispeech
+- openslr/librispeech_asr
+- CSTR-Edinburgh/vctk
+- google/fleurs
+- mozilla-foundation/common_voice_13_0
+metrics:
+- wer
 ---
+# Model Card for NeuCodec
+<!-- Provide a quick summary of what the model is/does. -->
+NeuCodec is an FSQ-based audio codec for speech tokenization.
+## Model Details
+<!-- Provide a longer summary of what this model is. -->
+NeuCodec is an ultra low bit-rate audio codec which takes advantage of the following advances;
+* It uses both audio (BigCodec encoder) and semantic (Wav2Vec2-BERT-large) information in the encoding process.
+* The quantisation method is FSQ rather than RVQ, resulting in a single vector for the quantised output, which makes it ideal for downstream modeling in SpeechLMs.
+* At 50 tokens/sec and 16 bits per token, the bit-rate is 800 bits/sec.
+Our work is largely based on the work of HKUSTAudio/xcodec2(https://huggingface.co/HKUSTAudio/xcodec2).
+- **Developed by:** Neuphonic
+- **Model type:** Neural Audio Codec
+- **Language(s):** English
+- **License:** apache-2.0
+### Model Sources
+<!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/neuphonic/neucodec
+- **Paper:** *Coming soon*
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+NeuCodec can be used directly to compress audio for fast, low bitrate transmission.
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+### Downstream Use
+As NeuCodec compression results in a single-vector tokenizable encoding of input audio, NeuCodec tokens are intended to be used as a training target or input to a SpeechLM for tasks such as speech synthesis or speech recognition.
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+## How to Get Started with the Model
+Use the code below to get started with the model.
+To install from pypi in a dedicated environment:
+```bash
+conda create -n neucodec python>3.9
+conda activate neucodec
+pip install neucodec
+```
+Then, to use in python:
+```python
+import torch
+import soundfile as sf
+from transformers import AutoConfig
+from neucodec import NeuCodec
+model_path = "Neuphonic/neucodec"
+model = NeuCodec.from_pretrained(model_path)
+model.eval().cuda()
+wav, sr = sf.read("test.wav")
+wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)  # Shape: (1, T)
+with torch.no_grad():
+    vq_code = model.encode_code(input_waveform=wav_tensor)
+    print("Codes: ", vq_code)
+    recon_wav = model.decode_code(vq_code).cpu()       # Shape: (1, 1, T')
+sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The model was trained on a mix of publicly available and proprietary data. The publicly available data includes the English segments of Emilia-YODAS, MLS, LibriTTS, Fleurs, CommonVoice, and HUI.
+### Training Procedure
+The model was trained for 800k steps on one 8xH100 node with an effective batch size of 64.
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+CMU-Arctic
+<!-- This should link to a Dataset Card if possible. -->
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+As we are interested the the degree of distortion from the unencoded to reconstructed audio, our evaluation metrics include. PESQ, STOI, SI-SDR, Mel-Spectrogram MSE, and diff WER.
+### Results
+| Codec	| Quantizer Token Rate |	Tokens Per Second |	Bitrate |	Codebook size |	Quantizers |	Params |	Autoencoding RTF	| Decoding RTF |	WER (%) |	CER (%) |
+| -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- | -------- |
+| DAC 	|	75 |	600 |	6kbps |	1024 |	8 |	74.7 |	0.015 |	0.007 |	1.9 |	0.06 |
+| Mimi 	|	12.5 |	150	|1.1kbps |	2k	| 8| 	79.3| 	0.012|	0.006|	3.0|	1.4 |
+| NeuCodec |	50 |	50|	0.8kbps |	65k|	1|	800|	0.030|	0.003|	2.5|	1.0|
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+Coming Soon