Feature Extraction
NeMo
CasanovaE commited on
Commit
0999213
·
verified ·
1 Parent(s): 978ef76

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -7,7 +7,7 @@ pipeline_tag: feature-extraction
7
  ---
8
 
9
 
10
- # NVIDIA nemo-nano-codec
11
  <style>
12
  img{
13
  display: inline-table;
@@ -21,7 +21,7 @@ padding: 0;
21
  | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
22
 
23
 
24
- The [nemo-nano-codec]() is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve state-of-the-art audio compression across different bitrate and frame rate ranges.
25
  Model variant details:
26
 
27
  | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
@@ -32,7 +32,7 @@ This model is ready for commercial/non-commercial use.
32
 
33
 
34
 
35
- ## nemo-nano-codec variants
36
 
37
  Model | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
38
  :-----------:|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
@@ -54,13 +54,13 @@ Model | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Em
54
  <br>Huggingface [08/11/2025] via https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps<br>
55
 
56
  ## Model Architecture
57
- nemo-nano-codec is composed of a fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
58
 
59
  The non-causal encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). The causal decoder, based on the HiFi-GAN vocoder, uses upsampling rates that are the reverse of the encoder's One-Dimensional (1D) convolutional strides.
60
 
61
  For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with thirteen codebooks and four dimensions per code and 2016 codes per codebook. For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646), [multi-band multi-scale STFT discriminator](https://arxiv.org/abs/2306.06546), and [WavLM-based discriminator](https://arxiv.org/abs/2409.12117).
62
 
63
- For more details please check [our paper]().
64
 
65
  **This model was developed based on [NVIDIA Low Frame-rate Speech Codec](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz)**
66
 
@@ -183,11 +183,11 @@ For fine-tuning on another dataset, please follow the steps available at our [Au
183
 
184
  ## Training, Testing, and Evaluation Datasets:
185
 
186
- The nemo-nano-codec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to [our paper]().
187
 
188
 
189
  ### Training Datasets
190
- The nemo-nano-codec is trained on a total of 28.7k hrs of speech data from 105 languages.
191
 
192
  Link: [MLS English](https://www.openslr.org/94/) [25.5k]
193
 
@@ -242,7 +242,7 @@ Link: [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_v
242
 
243
  ## Performance
244
 
245
- We evaluated our codec using multiple objective audio quality metrics across two distinct test sets. Additionally, we compared our model's performance with state-of-the-art codecs. For further details, please refer to [our paper]().
246
 
247
 
248
  Variant results:
 
7
  ---
8
 
9
 
10
+ # NVIDIA NeMo NanoCodec
11
  <style>
12
  img{
13
  display: inline-table;
 
21
  | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
22
 
23
 
24
+ The [NeMo NanoCodec](https://arxiv.org/abs/2508.05835v1) is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve state-of-the-art audio compression across different bitrate and frame rate ranges.
25
  Model variant details:
26
 
27
  | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
 
32
 
33
 
34
 
35
+ ## NeMo NanoCodec variants
36
 
37
  Model | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
38
  :-----------:|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
 
54
  <br>Huggingface [08/11/2025] via https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps<br>
55
 
56
  ## Model Architecture
57
+ NeMo NanoCodec is composed of a fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
58
 
59
  The non-causal encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). The causal decoder, based on the HiFi-GAN vocoder, uses upsampling rates that are the reverse of the encoder's One-Dimensional (1D) convolutional strides.
60
 
61
  For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with thirteen codebooks and four dimensions per code and 2016 codes per codebook. For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646), [multi-band multi-scale STFT discriminator](https://arxiv.org/abs/2306.06546), and [WavLM-based discriminator](https://arxiv.org/abs/2409.12117).
62
 
63
+ For more details please check [our paper](https://arxiv.org/abs/2508.05835v1).
64
 
65
  **This model was developed based on [NVIDIA Low Frame-rate Speech Codec](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz)**
66
 
 
183
 
184
  ## Training, Testing, and Evaluation Datasets:
185
 
186
+ The NeMo NanoCodec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to [our paper](https://arxiv.org/abs/2508.05835v1).
187
 
188
 
189
  ### Training Datasets
190
+ The NeMo NanoCodec is trained on a total of 28.7k hrs of speech data from 105 languages.
191
 
192
  Link: [MLS English](https://www.openslr.org/94/) [25.5k]
193
 
 
242
 
243
  ## Performance
244
 
245
+ We evaluated our codec using multiple objective audio quality metrics across two distinct test sets. Additionally, we compared our model's performance with state-of-the-art codecs. For further details, please refer to [our paper](https://arxiv.org/abs/2508.05835v1).
246
 
247
 
248
  Variant results: