MusicInfuser

Project Page Paper

MusicInfuser adapts a text-to-video diffusion model to align with music, generating dance videos according to the music and text prompts.

Requirements

We have tested on Python 3.10 with torch>=2.4.1+cu118, torchaudio>=2.4.1+cu118, and torchvision>=0.19.1+cu118. This repository requires a single A100 GPU for training and inference.

Installation

# Clone the repository
git clone https://github.com/SusungHong/MusicInfuser
cd MusicInfuser

# Create and activate conda environment
conda create -n musicinfuser python=3.10
conda activate musicinfuser

# Install dependencies
pip install -r requirements.txt
pip install -e ./mochi --no-build-isolation

# Download model weights
python ./music_infuser/download_weights.py weights/

Inference

To generate videos from music inputs:

python inference.py --input-file {MP3 or MP4 to extract audio from} \
                    --prompt {prompt} \
                    --num-frames {number of frames}

with the following arguments:

  • --input-file: Input file (MP3 or MP4) to extract audio from.
  • --prompt: Prompt for the dancer generation. The more specific a prompt is, generally the better the results, but more specificity decreases the effect of audio. Default: "a professional female dancer dancing K-pop in an advanced dance setting in a studio with a white background, captured from a front view"
  • --num-frames: Number of frames to generate. While originally trained with 73 frames, MusicInfuser can extrapolate to longer sequences. Default: 145

also consider:

  • --seed: Random seed for generation. The resulting dance also depends on the random seed, so feel free to change it. Default: None
  • --cfg-scale: Classifier-Free Guidance (CFG) scale for the text prompt. Default: 6.0

Dataset

For the AIST dataset, please see the terms of use and download it at the AIST Dance Video Database.

Training

To train the model on your dataset:

  1. Preprocess your data:
bash music_infuser/preprocess.bash -v {dataset path} -o {processed video output dir} -w {path to pretrained mochi} --num_frames {number of frames}
  1. Run training:
bash music_infuser/run.bash -c music_infuser/configs/music_infuser.yaml -n 1

Note: The current implementation only supports single-GPU training, which requires approximately 80GB of VRAM to train with 73-frame sequences.

VLM Evaluation

For evaluating the model using Visual Language Models:

  1. Follow the instructions in vlm_eval/README.md to set up the VideoLLaMA2 evaluation framework
  2. It is recommended to use a separate environment from MusicInfuser for the evaluation

Citation

@article{hong2025musicinfuser,
  title={MusicInfuser: Making Video Diffusion Listen and Dance},
  author={Hong, Susung and Kemelmacher-Shlizerman, Ira and Curless, Brian and Seitz, Steven M},
  journal={arXiv preprint arXiv:2503.14505},
  year={2025}
}

Acknowledgements

This code builds upon the following awesome repositories:

We thank the authors for open-sourcing their code and models, which made this work possible.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.