Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,7 @@ pipeline_tag: feature-extraction
|
|
7 |
---
|
8 |
|
9 |
|
10 |
-
# NVIDIA
|
11 |
<style>
|
12 |
img{
|
13 |
display: inline-table;
|
@@ -21,7 +21,7 @@ padding: 0;
|
|
21 |
| [](#datasets)
|
22 |
|
23 |
|
24 |
-
The [
|
25 |
Model variant details:
|
26 |
|
27 |
| Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
|
@@ -32,7 +32,7 @@ This model is ready for commercial/non-commercial use.
|
|
32 |
|
33 |
|
34 |
|
35 |
-
##
|
36 |
|
37 |
Model | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
|
38 |
:-----------:|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
|
@@ -54,13 +54,13 @@ Model | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Em
|
|
54 |
<br>Huggingface [08/11/2025] via https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps<br>
|
55 |
|
56 |
## Model Architecture
|
57 |
-
|
58 |
|
59 |
The non-causal encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). The causal decoder, based on the HiFi-GAN vocoder, uses upsampling rates that are the reverse of the encoder's One-Dimensional (1D) convolutional strides.
|
60 |
|
61 |
For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with thirteen codebooks and four dimensions per code and 2016 codes per codebook. For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646), [multi-band multi-scale STFT discriminator](https://arxiv.org/abs/2306.06546), and [WavLM-based discriminator](https://arxiv.org/abs/2409.12117).
|
62 |
|
63 |
-
For more details please check [our paper]().
|
64 |
|
65 |
**This model was developed based on [NVIDIA Low Frame-rate Speech Codec](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz)**
|
66 |
|
@@ -183,11 +183,11 @@ For fine-tuning on another dataset, please follow the steps available at our [Au
|
|
183 |
|
184 |
## Training, Testing, and Evaluation Datasets:
|
185 |
|
186 |
-
The
|
187 |
|
188 |
|
189 |
### Training Datasets
|
190 |
-
The
|
191 |
|
192 |
Link: [MLS English](https://www.openslr.org/94/) [25.5k]
|
193 |
|
@@ -242,7 +242,7 @@ Link: [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_v
|
|
242 |
|
243 |
## Performance
|
244 |
|
245 |
-
We evaluated our codec using multiple objective audio quality metrics across two distinct test sets. Additionally, we compared our model's performance with state-of-the-art codecs. For further details, please refer to [our paper]().
|
246 |
|
247 |
|
248 |
Variant results:
|
|
|
7 |
---
|
8 |
|
9 |
|
10 |
+
# NVIDIA NeMo NanoCodec
|
11 |
<style>
|
12 |
img{
|
13 |
display: inline-table;
|
|
|
21 |
| [](#datasets)
|
22 |
|
23 |
|
24 |
+
The [NeMo NanoCodec](https://arxiv.org/abs/2508.05835v1) is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve state-of-the-art audio compression across different bitrate and frame rate ranges.
|
25 |
Model variant details:
|
26 |
|
27 |
| Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
|
|
|
32 |
|
33 |
|
34 |
|
35 |
+
## NeMo NanoCodec variants
|
36 |
|
37 |
Model | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
|
38 |
:-----------:|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
|
|
|
54 |
<br>Huggingface [08/11/2025] via https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps<br>
|
55 |
|
56 |
## Model Architecture
|
57 |
+
NeMo NanoCodec is composed of a fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
|
58 |
|
59 |
The non-causal encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). The causal decoder, based on the HiFi-GAN vocoder, uses upsampling rates that are the reverse of the encoder's One-Dimensional (1D) convolutional strides.
|
60 |
|
61 |
For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with thirteen codebooks and four dimensions per code and 2016 codes per codebook. For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646), [multi-band multi-scale STFT discriminator](https://arxiv.org/abs/2306.06546), and [WavLM-based discriminator](https://arxiv.org/abs/2409.12117).
|
62 |
|
63 |
+
For more details please check [our paper](https://arxiv.org/abs/2508.05835v1).
|
64 |
|
65 |
**This model was developed based on [NVIDIA Low Frame-rate Speech Codec](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz)**
|
66 |
|
|
|
183 |
|
184 |
## Training, Testing, and Evaluation Datasets:
|
185 |
|
186 |
+
The NeMo NanoCodec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to [our paper](https://arxiv.org/abs/2508.05835v1).
|
187 |
|
188 |
|
189 |
### Training Datasets
|
190 |
+
The NeMo NanoCodec is trained on a total of 28.7k hrs of speech data from 105 languages.
|
191 |
|
192 |
Link: [MLS English](https://www.openslr.org/94/) [25.5k]
|
193 |
|
|
|
242 |
|
243 |
## Performance
|
244 |
|
245 |
+
We evaluated our codec using multiple objective audio quality metrics across two distinct test sets. Additionally, we compared our model's performance with state-of-the-art codecs. For further details, please refer to [our paper](https://arxiv.org/abs/2508.05835v1).
|
246 |
|
247 |
|
248 |
Variant results:
|