EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .

🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.

🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: EzAudio Space

🎮 EzAudio-ControlNet is available: EzAudio-ControlNet Space

Installation

Clone the repository:

git clone [email protected]:haidog-yaqub/EzAudio.git

Install the dependencies:

cd EzAudio
pip install -r requirements.txt

Download checkponts (Optional): https://huggingface.co/OpenSound/EzAudio

Usage

You can use the model with the following code:

from api.ezaudio import EzAudio
import torch
import soundfile as sf

# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ezaudio = EzAudio(model_name='s3_xl', device=device)

# text to audio genertation
prompt = "a dog barking in the distance"
sr, audio = ezaudio.generate_audio(prompt)
sf.write(f'{prompt}.wav', audio, sr)

# audio inpainting
prompt = "A train passes by, blowing its horns"
original_audio = 'ref.wav'
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
                                  mask_start=1, mask_length=5)
sf.write(f'{prompt}_edit.wav', audio, sr)

Training

Autoencoder

Refer to the VAE training section in our work SoloAudio

T2A Diffusion Model

Prepare your data (see example in src/dataset/meta_example.csv), then run:

cd src
accelerate launch train.py

Todo

Release Gradio Demo along with checkpoints EzAudio Space
Release ControlNet Demo along with checkpoints EzAudio ControlNet Space
Release inference code
Release training pipeline and dataset
Improve API and support automatic ckpts downloading
Release checkpoints for stage1 and stage2 [WIP]

Reference

If you find the code useful for your research, please consider citing:

@article{hai2024ezaudio,
  title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
  author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
  journal={arXiv preprint arXiv:2409.10819},
  year={2024}
}

Acknowledgement

Some codes are borrowed from or inspired by: U-Vit, Pixel-Art, Huyuan-DiT, and Stable Audio.

Downloads last month: -

OpenSound
/

EzAudio