nvidia
/

nemo-nano-codec-22khz-0.6kbps-12.5fps

Feature Extraction

NeMo

Model card Files Files and versions

xet

Community

CasanovaE commited on 28 days ago

Commit

0999213

verified ·

1 Parent(s): 978ef76

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -7,7 +7,7 @@ pipeline_tag: feature-extraction
 ---
-# NVIDIA nemo-nano-codec
 <style>
 img{
 display: inline-table;
@@ -21,7 +21,7 @@ padding: 0;
 | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
-The [nemo-nano-codec]() is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to  achieve state-of-the-art audio compression across different bitrate and frame rate ranges.
 Model variant details:
 | Sample Rate | Frame Rate | Bit Rate   | # Codebooks | Codebook Size | Embed Dim   | FSQ Levels   |
@@ -32,7 +32,7 @@ This model is ready for commercial/non-commercial use.
-## nemo-nano-codec variants
 Model | Sample Rate | Frame Rate | Bit Rate   | # Codebooks | Codebook Size | Embed Dim   | FSQ Levels   |
 :-----------:|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
@@ -54,13 +54,13 @@ Model | Sample Rate | Frame Rate | Bit Rate   | # Codebooks | Codebook Size | Em
 <br>Huggingface [08/11/2025] via https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps<br>
 ## Model Architecture
-nemo-nano-codec is composed of a fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
 The non-causal encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). The causal decoder, based on the HiFi-GAN vocoder, uses upsampling rates that are the reverse of the encoder's One-Dimensional (1D) convolutional strides.
 For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with thirteen codebooks and four dimensions per code and 2016 codes per codebook.  For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646), [multi-band multi-scale STFT discriminator](https://arxiv.org/abs/2306.06546), and  [WavLM-based discriminator](https://arxiv.org/abs/2409.12117).
-For more details please check [our paper]().
 **This model was developed based on [NVIDIA Low Frame-rate Speech Codec](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz)**
@@ -183,11 +183,11 @@ For fine-tuning on another dataset, please follow the steps available at our [Au
 ## Training, Testing, and Evaluation Datasets:
-The nemo-nano-codec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to  [our paper]().
 ### Training Datasets
-The nemo-nano-codec is trained on a total of 28.7k hrs of speech data from 105 languages.
 Link: [MLS English](https://www.openslr.org/94/) [25.5k]
@@ -242,7 +242,7 @@ Link: [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_v
 ## Performance
-We evaluated our codec using multiple objective audio quality metrics across two distinct test sets. Additionally, we compared our model's performance with state-of-the-art codecs. For further details, please refer to [our paper]().
 Variant results:

 ---
+# NVIDIA NeMo NanoCodec
 <style>
 img{
 display: inline-table;
 | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
+The [NeMo NanoCodec](https://arxiv.org/abs/2508.05835v1) is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to  achieve state-of-the-art audio compression across different bitrate and frame rate ranges.
 Model variant details:
 | Sample Rate | Frame Rate | Bit Rate   | # Codebooks | Codebook Size | Embed Dim   | FSQ Levels   |
+## NeMo NanoCodec variants
 Model | Sample Rate | Frame Rate | Bit Rate   | # Codebooks | Codebook Size | Embed Dim   | FSQ Levels   |
 :-----------:|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
 <br>Huggingface [08/11/2025] via https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps<br>
 ## Model Architecture
+NeMo NanoCodec is composed of a fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
 The non-causal encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). The causal decoder, based on the HiFi-GAN vocoder, uses upsampling rates that are the reverse of the encoder's One-Dimensional (1D) convolutional strides.
 For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with thirteen codebooks and four dimensions per code and 2016 codes per codebook.  For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646), [multi-band multi-scale STFT discriminator](https://arxiv.org/abs/2306.06546), and  [WavLM-based discriminator](https://arxiv.org/abs/2409.12117).
+For more details please check [our paper](https://arxiv.org/abs/2508.05835v1).
 **This model was developed based on [NVIDIA Low Frame-rate Speech Codec](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz)**
 ## Training, Testing, and Evaluation Datasets:
+The NeMo NanoCodec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to  [our paper](https://arxiv.org/abs/2508.05835v1).
 ### Training Datasets
+The NeMo NanoCodec is trained on a total of 28.7k hrs of speech data from 105 languages.
 Link: [MLS English](https://www.openslr.org/94/) [25.5k]
 ## Performance
+We evaluated our codec using multiple objective audio quality metrics across two distinct test sets. Additionally, we compared our model's performance with state-of-the-art codecs. For further details, please refer to [our paper](https://arxiv.org/abs/2508.05835v1).
 Variant results: