BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
Abstract
BindWeave, a unified framework using MLLM-DiT, enhances subject-consistent video generation by integrating deep cross-modal reasoning with diffusion transformers, achieving superior performance on OpenS2V.
Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation (2025)
- PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation (2025)
- UniVid: The Open-Source Unified Video Model (2025)
- Lynx: Towards High-Fidelity Personalized Video Generation (2025)
- Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis (2025)
- RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer (2025)
- HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper