MagiCodec: Simple Masked Gaussian-Injected Audio Codec for High-Fidelity Reconstruction & Generation
MagiCodec is a single-layer, streaming codec model that delivers state-of-the-art audio quality and highly model-able discrete tokens.
This is the code for the MagiCodec neural audio codec presented in the paper MagiCodec: Simple Masked Gaussian-Injected Audio Codec for High-Fidelity Reconstruction and Generation [abs].
โจ Key Features
- Singleโlayer streaming design โ lightweight causal Transformer that drops straight into audioโnative LLMs with minimal latency.
- Low bitโrate & computeโefficient โ <= 850โฏbps and <=โฏ40โฏms lookโahead keep bandwidth and FLOPs tiny for onโdevice use.
- Rich acoustic and semantic information โ captures fineโgrained detail plus highโlevel semantics, achieving top scores in both waveform reconstruction and downstream tasks.
๐ง Samples
To get a quick sense of our codecโs performance, please check out the Sample Page. Comprehensive benchmarks covering a variety of baselines are available on this page.
๐ ๏ธ Installation
Our released code is based on a specific version (2.3.6) of FlashAttention.
Please note that different (including newer or older) versions of flash-attention may introduce changes to the C/Python interfaces called by our custom ops. Compatibility issues may arise if you use a different version of flash-attention.
We recommend creating a fresh conda environment for installation.
Env Setup
conda create -n MagiCodec python=3.9 -y
conda activate MagiCodec
Basic Requirements
change to your own cuda version.
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install packaging ninja cmake pybind11 pytorch_lightning transformers
conda install -c conda-forge 'ffmpeg<7'
Flash-Attention Build
If your environment (including PyTorch, Python, and CUDA versions) matches one of the prebuilt wheels provided in the flash-attention 2.3.6 release, you can directly install flash-attention from the corresponding wheel. Otherwise, we recommend building flash-attention from source manually for best compatibility. The following script can be used to build flash-attention manually:
# flash-attn build
git clone https://github.com/Dao-AILab/flash-attention.git
git checkout 92dd570
git submodule sync --recursive
git submodule update --init --recursive
cd flash-attention
python setup.py install
# ops build
cd csrc
cd rotary && python setup.py install && cd ..
cd layer_norm && python setup.py install && cd ..
cd fused_dense_lib && python setup.py install && cd ..
Afterwards, you can clone this repository, and enjoy using MagiCodec!
๐ Quick Start
Checkpoint Download
Pre-trained model checkpoints are available. Please use the following links to download the checkpoints:
Model Name | Dataset | Sample Rate | Token Rate |
---|---|---|---|
๐ค MagiCodec-50Hz-Base | Librilight | 16k Hz | 50 Hz |
Inference from raw wav
You can get discrete tokens and reconstructed waveform from raw wav using the following code:
from codec.generator import Generator
import torch
import torchaudio
torch.set_grad_enabled(False)
target_sr = 16000
token_hz = 50
model_path = "MagiCodec-50Hz-Base.ckpt"
model = Generator(
sample_rate = target_sr,
token_hz = token_hz,
)
state_dict = torch.load(model_path, map_location='cpu')
model.load_state_dict(state_dict, strict=False)
device = f"cuda:0"
model = model.to(device)
model.eval()
def preprocess(path):
x, sr = torchaudio.load(path)
x = x.to(device)
x = x.mean(dim=0, keepdim=True)
x = torchaudio.functional.resample(x, sr, target_sr)
return x[None, ...]
def infer(path):
x = preprocess(path)
orig_length = x.shape[-1]
recon, codes, zq = model.infer(x)
recon = recon[..., :orig_length]
return recon, codes
recon, codes = infer("audio/1580-141083-0000.flac")
torchaudio.save("recon.wav", recon[0].cpu(), target_sr)
Inference from codes
You can also get reconstructed waveform from discrete tokens using the following code:
def infer_from_code(codes, codebook):
assert codes.ndim == 2 # (B, T)
z_q = torch.nn.functional.embedding(codes, codebook)
recon = model.decoder(z_q).float()
return recon
with torch.autocast(
device_type = "cuda",
dtype = torch.bfloat16,
enabled = True,
):
codebook = model.quantizer.codebook_proj(model.quantizer.codebook.weight)
recon_from_code = infer_from_code(codes, codebook)
torchaudio.save("recon_from_code.wav", recon_from_code[0].cpu(), target_sr)
print((recon==recon_from_code).all().item())
๐ก Tips
- For devices that do not support BF16, you can manually disable PyTorchโs mixed precision manager.
- If you encounter any issues or have questions, please feel free to open an issue.
๐ Citation
If you find this repo helpful, please cite our work:
@misc{song2025magicodec,
title={MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation},
author={Yakun Song and Jiawei Chen and Xiaobin Zhuang and Chenpeng Du and Ziyang Ma and Jian Wu and Jian Cong and Dongya Jia and Zhuo Chen and Yuping Wang and Yuxuan Wang and Xie Chen},
year={2025},
eprint={2506.00385},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2506.00385},
}
๐ License
The code in this repository is released under the MIT license, see LICENSE for details.