arxiv:2506.02444

SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

Published on Jun 3

· Submitted by

levondang on Jun 6

Upvote

Authors:

Lingwei Dang ,

Abstract

A framework combining visual priors and dynamic constraints within a synchronized diffusion process generates HOI video and motion simultaneously, enhancing video-motion consistency and generalization.

AI-generated summary

Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo\_project.

View arXiv page View PDF Add to collection

Community

levondang

Paper author Paper submitter 1 day ago

TL;DR: A novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process for joint generation of video and motion in Hand-Object Interaction (HOI) scenarios.
Project page at https://github.com/Droliven/SViMo_project.
Video demonstration: https://www.youtube.com/watch?v=pVkntn-8KHo.

levondang

Paper author Paper submitter about 6 hours ago

TL;DR: A novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process for joint generation of video and motion in Hand-Object Interaction (HOI) scenarios.
Project page at https://github.com/Droliven/SViMo_project.
Video demonstration: https://www.youtube.com/watch?v=pVkntn-8KHo.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.02444 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.02444 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.02444 in a Space README.md to link it from this page.