JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

We introduce JavisDiT, a novel & SoTA Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG) from open-ended user prompts.

πŸ“° News

  • [2025.08.11] πŸ”₯ We released the data and code for JAVG evaluation. For more details refer to here and eval/javisbench/README.md.
  • [2025.04.15] πŸ”₯ We released the data preparation and model training instructions. You can train JavisDiT with your own dataset!
  • [2025.04.07] πŸ”₯ We released the inference code and a preview model of JavisDiT-v0.1 at HuggingFace, which includes JavisDiT-v0.1-audio, JavisDiT-v0.1-prior, and JavisDiT-v0.1-jav (with a low-resolution version and a full-resolution version).
  • [2025.04.03] We release the repository of JavisDiT. Code, model, and data are coming soon.

πŸ‘‰ TODO

  • Release the data and evaluation code for JavisScore.
  • Deriving a more efficient and powerful JAVG model.

Brief Introduction

JavisDiT addresses the key bottleneck of JAVG with Hierarchical Spatio-Temporal Prior Synchronization.

  • We introduce JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG) from open-ended user prompts.
  • We propose JavisBench, a new benchmark consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios.
  • We devise JavisScore, a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content.
  • We curate JavisEval, a dataset with 3,000 human-annotated samples to quantitatively evaluate the accuracy of synchronization estimate metrics.

We hope to set a new standard for the JAVG community. For more technical details, kindly refer to the original paper.

Citation

If you find JavisDiT is useful and use it in your project, please kindly cite:

@inproceedings{liu2025javisdit,
      title={JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization}, 
      author={Kai Liu and Wei Li and Lai Chen and Shengqiong Wu and Yanhao Zheng and Jiayi Ji and Fan Zhou and Rongxin Jiang and Jiebo Luo and Hao Fei and Tat-Seng Chua},
      booktitle={arxiv},
      year={2025},
      eprint={2503.23377},
}
Downloads last month
497
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including JavisDiT/JavisDiT-v0.1-jav