
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .
๐ฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
๐ Play with EzAudio for text-to-audio generation, editing, and inpainting: EzAudio Space
๐ฎ EzAudio-ControlNet is available: EzAudio-ControlNet Space
Installation
Clone the repository:
git clone [email protected]:haidog-yaqub/EzAudio.git
Install the dependencies:
cd EzAudio
pip install -r requirements.txt
Download checkponts (Optional): https://huggingface.co/OpenSound/EzAudio
Usage
You can use the model with the following code:
from api.ezaudio import EzAudio
import torch
import soundfile as sf
# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ezaudio = EzAudio(model_name='s3_xl', device=device)
# text to audio genertation
prompt = "a dog barking in the distance"
sr, audio = ezaudio.generate_audio(prompt)
sf.write(f'{prompt}.wav', audio, sr)
# audio inpainting
prompt = "A train passes by, blowing its horns"
original_audio = 'ref.wav'
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
mask_start=1, mask_length=5)
sf.write(f'{prompt}_edit.wav', audio, sr)
Training
Autoencoder
Refer to the VAE training section in our work SoloAudio
T2A Diffusion Model
Prepare your data (see example in src/dataset/meta_example.csv
), then run:
cd src
accelerate launch train.py
Todo
- Release Gradio Demo along with checkpoints EzAudio Space
- Release ControlNet Demo along with checkpoints EzAudio ControlNet Space
- Release inference code
- Release training pipeline and dataset
- Improve API and support automatic ckpts downloading
- Release checkpoints for stage1 and stage2 [WIP]
Reference
If you find the code useful for your research, please consider citing:
@article{hai2024ezaudio,
title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
journal={arXiv preprint arXiv:2409.10819},
year={2024}
}
Acknowledgement
Some codes are borrowed from or inspired by: U-Vit, Pixel-Art, Huyuan-DiT, and Stable Audio.
- Downloads last month
- 0