File size: 3,488 Bytes
4c1af4f 044f7c3 9113ccc 622e1b9 6a1b3c7 622e1b9 812019b 8272707 5677740 7541b73 5677740 7541b73 5677740 7541b73 5677740 7541b73 5677740 7541b73 5677740 7541b73 5677740 4c1af4f 6a1b3c7 622e1b9 6a1b3c7 622e1b9 56d97d7 207397c 6a1b3c7 46c990c 622e1b9 46c990c 622e1b9 6a1b3c7 1151d1a 9bb5b08 1151d1a 622e1b9 499709c 622e1b9 5fbf0de 622e1b9 5fbf0de 622e1b9 155ed89 622e1b9 155ed89 622e1b9 155ed89 622e1b9 155ed89 622e1b9 155ed89 622e1b9 6b58313 622e1b9 499709c 622e1b9 6a1b3c7 622e1b9 4eee9d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
license: apache-2.0
tags:
- audio
- speech
- audio-to-audio
- speech-language-models
datasets:
- amphion/Emilia-Dataset
- facebook/multilingual_librispeech
- CSTR-Edinburgh/vctk
- google/fleurs
- mozilla-foundation/common_voice_13_0
- mythicinfinity/libritts_r
---
# NeuCodec π§
[](https://www.youtube.com/watch?v=O7XH1lGZyYY)
*Click the image above to see NeuCodec in action on Youtube!*
*Created by Neuphonic - building faster, smaller, on-device voice AI*
A lightweight neural codec that encodes audio at just 0.8 kbps - perfect for researchers and builders who need something that *just works* for training high quality text-to-speech models.
# Key Features
* π Low bit-rate compression - a speech codec that compresses and reconstructs audio with near-inaudible reconstruction loss
<br>
* πΌ Upsamples from 16kHz β 24kHz
<br>
* π Ready for real-world use - train your own SpeechLMs without needing to build your own codec
<br>
* π’ Commercial use permitted - use it in your own tools or products
<br>
* π Released with large pre-encoded datasets - weβve compressed Emilia-YODAS from 1.7TB to 41GB using NeuCodec, significantly reducing the compute requirements needed for training
<br>
# Model Details
NeuCodec is a Finite Scalar Quantisation (FSQ) based 0.8kbps audio codec for speech tokenization.
It takes advantage of the following features:
* FSQ quantisation resulting in a single codebook, making it ideal for downstream modeling with Speech Language Models.
* Trained with CC data such that there are no Non-Commercial data restrictions.
* At 50 tokens/sec and 16 bits per token, the overall bit-rate is 0.8kbps.
* The codec takes in 16kHz input and outputs 24kHz using an upsampling decoder.
* The FSQ encoding scheme allows for bit-level error resistance suitable for unreliable and noisy channels.
NeuCodec is largely based on extending the work of [X-Codec2.0](https://huggingface.co/HKUSTAudio/xcodec2).
- **Developed by:** Neuphonic
- **Model type:** Neural Audio Codec
- **License:** apache-2.0
- **Repository:** https://github.com/neuphonic/neucodec
- **Paper:** *Coming soon!*
- **Pre-encoded Datasets:**
- [Emilia-YODAS-EN](https://huggingface.co/datasets/neuphonic/emilia-yodas-english-neucodec)
- *More coming soon!*
# Get Started
Use the code below to get started with the model.
To install from pypi in a dedicated environment, using Python 3.10 or above:
```bash
conda create -n neucodec python=3.10
conda activate neucodec
pip install neucodec
```
Then, to use in python:
```python
import librosa
import torch
import torchaudio
from torchaudio import transforms as T
from neucodec import NeuCodec
model = NeuCodec.from_pretrained("neuphonic/neucodec")
model.eval().cuda()
y, sr = torchaudio.load(librosa.ex("libri1"))
if sr != 16_000:
y = T.Resample(sr, 16_000)(y)[None, ...] # (B, 1, T_16)
with torch.no_grad():
fsq_codes = model.encode_code(y)
# fsq_codes = model.encode_code(librosa.ex("libri1")) # or directly pass your filepath!
print(f"Codes shape: {fsq_codes.shape}")
recon = model.decode_code(fsq_codes).cpu() # (B, 1, T_24)
torchaudio.save("reconstructed.wav", recon[0, :, :], 24_000)
```
# Training Details
The model was trained using the following data:
* Emilia-YODAS
* MLS
* LibriTTS
* Fleurs
* CommonVoice
* HUI
* Additional proprietary set
All publically available data was covered by either the CC-BY-4.0 or CC0 license. |