arxiv:2508.16577

MV-RAG: Retrieval Augmented Multiview Diffusion

Published on Aug 22

· Submitted by

yosepyossi on Aug 26

#3 Paper of the day

Upvote

Authors:

Yosef Dayani ,

Omer Benishu ,

Sagie Benaim

Abstract

MV-RAG enhances text-to-3D generation by retrieving 2D images and conditioning a multiview diffusion model to improve consistency and accuracy, especially for out-of-domain concepts.

AI-generated summary

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

View arXiv page View PDF Project page GitHub 19 Add to collection

Community

yosepyossi

Paper author Paper submitter 3 days ago

MV-RAG extends the strengths of RAG by addressing challenges such as out-of-domain generations (e.g., ‘Bolognese dog’) and emerging concepts introduced after training (e.g., ‘Labubu doll’).

MV-RAG advances multiview generation by combining a pretrained multiview model’s internal knowledge with external visual cues retrieved from a large image database. At inference, the retrieved 2D images are encoded into tokens using an image encoder followed by a learned resampler. Within the multiview diffusion model, 3D self-attention layers enforce consistency across the generated views. Each cross-attention layer then operates in two parallel branches: one conditioned on text tokens and the other on retrieved image tokens. Their outputs are fused using a fusion coefficient predicted by the Prior-Guided Attention module.