MusicInfuser
MusicInfuser adapts a text-to-video diffusion model to align with music, generating dance videos according to the music and text prompts.
Requirements
We have tested on Python 3.10 with torch>=2.4.1+cu118
, torchaudio>=2.4.1+cu118
, and torchvision>=0.19.1+cu118
. This repository requires a single A100 GPU for training and inference.
Installation
# Clone the repository
git clone https://github.com/SusungHong/MusicInfuser
cd MusicInfuser
# Create and activate conda environment
conda create -n musicinfuser python=3.10
conda activate musicinfuser
# Install dependencies
pip install -r requirements.txt
pip install -e ./mochi --no-build-isolation
# Download model weights
python ./music_infuser/download_weights.py weights/
Inference
To generate videos from music inputs:
python inference.py --input-file {MP3 or MP4 to extract audio from} \
--prompt {prompt} \
--num-frames {number of frames}
with the following arguments:
--input-file
: Input file (MP3 or MP4) to extract audio from.--prompt
: Prompt for the dancer generation. The more specific a prompt is, generally the better the results, but more specificity decreases the effect of audio. Default:"a professional female dancer dancing K-pop in an advanced dance setting in a studio with a white background, captured from a front view"
--num-frames
: Number of frames to generate. While originally trained with 73 frames, MusicInfuser can extrapolate to longer sequences. Default:145
also consider:
--seed
: Random seed for generation. The resulting dance also depends on the random seed, so feel free to change it. Default:None
--cfg-scale
: Classifier-Free Guidance (CFG) scale for the text prompt. Default:6.0
Dataset
For the AIST dataset, please see the terms of use and download it at the AIST Dance Video Database.
Training
To train the model on your dataset:
- Preprocess your data:
bash music_infuser/preprocess.bash -v {dataset path} -o {processed video output dir} -w {path to pretrained mochi} --num_frames {number of frames}
- Run training:
bash music_infuser/run.bash -c music_infuser/configs/music_infuser.yaml -n 1
Note: The current implementation only supports single-GPU training, which requires approximately 80GB of VRAM to train with 73-frame sequences.
VLM Evaluation
For evaluating the model using Visual Language Models:
- Follow the instructions in
vlm_eval/README.md
to set up the VideoLLaMA2 evaluation framework - It is recommended to use a separate environment from MusicInfuser for the evaluation
Citation
@article{hong2025musicinfuser,
title={MusicInfuser: Making Video Diffusion Listen and Dance},
author={Hong, Susung and Kemelmacher-Shlizerman, Ira and Curless, Brian and Seitz, Steven M},
journal={arXiv preprint arXiv:2503.14505},
year={2025}
}
Acknowledgements
This code builds upon the following awesome repositories:
We thank the authors for open-sourcing their code and models, which made this work possible.