Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
Abstract
A diffusion-based framework generates aligned novel views of images and geometry using warping-and-inpainting with cross-modal attention distillation and proximity-based mesh conditioning, achieving high-fidelity synthesis and 3D completion.
We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis (2025)
- JointSplat: Probabilistic Joint Flow-Depth Optimization for Sparse-View Gaussian Splatting (2025)
- NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation (2025)
- OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View (2025)
- Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss (2025)
- EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh (2025)
- Constructing a 3D Town from a Single Image (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper