MagiCodec: Simple Masked Gaussian-Injected Audio Codec for High-Fidelity Reconstruction & Generation

github arXiv demo model

MagiCodec is a single-layer, streaming codec model that delivers state-of-the-art audio quality and highly model-able discrete tokens.

This is the code for the MagiCodec neural audio codec presented in the paper MagiCodec: Simple Masked Gaussian-Injected Audio Codec for High-Fidelity Reconstruction and Generation [abs].

โœจ Key Features

  • Singleโ€‘layer streaming design โ€“ lightweight causal Transformer that drops straight into audioโ€‘native LLMs with minimal latency.
  • Low bitโ€‘rate & computeโ€‘efficient โ€“ <= 850โ€ฏbps and <=โ€ฏ40โ€ฏms lookโ€‘ahead keep bandwidth and FLOPs tiny for onโ€‘device use.
  • Rich acoustic and semantic information โ€“ captures fineโ€‘grained detail plus highโ€‘level semantics, achieving top scores in both waveform reconstruction and downstream tasks.

๐ŸŽง Samples

To get a quick sense of our codecโ€™s performance, please check out the Sample Page. Comprehensive benchmarks covering a variety of baselines are available on this page.

๐Ÿ› ๏ธ Installation

Our released code is based on a specific version (2.3.6) of FlashAttention.

Please note that different (including newer or older) versions of flash-attention may introduce changes to the C/Python interfaces called by our custom ops. Compatibility issues may arise if you use a different version of flash-attention.

We recommend creating a fresh conda environment for installation.

Env Setup

conda create -n MagiCodec python=3.9 -y
conda activate MagiCodec

Basic Requirements

change to your own cuda version.

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install packaging ninja cmake pybind11 pytorch_lightning transformers
conda install -c conda-forge 'ffmpeg<7'  

Flash-Attention Build

If your environment (including PyTorch, Python, and CUDA versions) matches one of the prebuilt wheels provided in the flash-attention 2.3.6 release, you can directly install flash-attention from the corresponding wheel. Otherwise, we recommend building flash-attention from source manually for best compatibility. The following script can be used to build flash-attention manually:

# flash-attn build
git clone https://github.com/Dao-AILab/flash-attention.git
git checkout 92dd570
git submodule sync --recursive
git submodule update --init --recursive
cd flash-attention
python setup.py install

# ops build
cd csrc
cd rotary && python setup.py install && cd ..
cd layer_norm && python setup.py install && cd ..
cd fused_dense_lib && python setup.py install && cd ..

Afterwards, you can clone this repository, and enjoy using MagiCodec!

๐Ÿš€ Quick Start

Checkpoint Download

Pre-trained model checkpoints are available. Please use the following links to download the checkpoints:

Model Name Dataset Sample Rate Token Rate
๐Ÿค— MagiCodec-50Hz-Base Librilight 16k Hz 50 Hz

Inference from raw wav

You can get discrete tokens and reconstructed waveform from raw wav using the following code:

from codec.generator import Generator
import torch
import torchaudio
torch.set_grad_enabled(False)

target_sr = 16000
token_hz = 50
model_path = "MagiCodec-50Hz-Base.ckpt"

model = Generator(
    sample_rate = target_sr,
    token_hz = token_hz,
)
state_dict = torch.load(model_path, map_location='cpu')
model.load_state_dict(state_dict, strict=False)

device = f"cuda:0"
model = model.to(device)
model.eval()

def preprocess(path):
    x, sr = torchaudio.load(path)
    x = x.to(device)
    x = x.mean(dim=0, keepdim=True)
    x = torchaudio.functional.resample(x, sr, target_sr)
    return x[None, ...]

def infer(path):
    x = preprocess(path)
    orig_length = x.shape[-1]

    recon, codes, zq = model.infer(x)
    recon = recon[..., :orig_length]
    return recon, codes

recon, codes = infer("audio/1580-141083-0000.flac")
torchaudio.save("recon.wav", recon[0].cpu(), target_sr)

Inference from codes

You can also get reconstructed waveform from discrete tokens using the following code:

def infer_from_code(codes, codebook):
    assert codes.ndim == 2   # (B, T)
    z_q = torch.nn.functional.embedding(codes, codebook)
    recon = model.decoder(z_q).float()
    return recon

with torch.autocast(
    device_type = "cuda",
    dtype = torch.bfloat16,
    enabled = True,
):
    codebook = model.quantizer.codebook_proj(model.quantizer.codebook.weight) 
    recon_from_code = infer_from_code(codes, codebook)

torchaudio.save("recon_from_code.wav", recon_from_code[0].cpu(), target_sr)

print((recon==recon_from_code).all().item())

๐Ÿ’ก Tips

  • For devices that do not support BF16, you can manually disable PyTorchโ€™s mixed precision manager.
  • If you encounter any issues or have questions, please feel free to open an issue.

๐Ÿ“ Citation

If you find this repo helpful, please cite our work:

@misc{song2025magicodec,
      title={MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation}, 
      author={Yakun Song and Jiawei Chen and Xiaobin Zhuang and Chenpeng Du and Ziyang Ma and Jian Wu and Jian Cong and Dongya Jia and Zhuo Chen and Yuping Wang and Yuxuan Wang and Xie Chen},
      year={2025},
      eprint={2506.00385},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2506.00385}, 
}

๐Ÿ“„ License

The code in this repository is released under the MIT license, see LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support