Papers
arxiv:2508.16577

MV-RAG: Retrieval Augmented Multiview Diffusion

Published on Aug 22
· Submitted by yosepyossi on Aug 26
#3 Paper of the day

Abstract

MV-RAG enhances text-to-3D generation by retrieving 2D images and conditioning a multiview diffusion model to improve consistency and accuracy, especially for out-of-domain concepts.

AI-generated summary

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

Community

Paper author Paper submitter

teaser2.png
MV-RAG extends the strengths of RAG by addressing challenges such as out-of-domain generations (e.g., ‘Bolognese dog’) and emerging concepts introduced after training (e.g., ‘Labubu doll’).

teaser.jpg

MV-RAG advances multiview generation by combining a pretrained multiview model’s internal knowledge with external visual cues retrieved from a large image database. At inference, the retrieved 2D images are encoded into tokens using an image encoder followed by a learned resampler. Within the multiview diffusion model, 3D self-attention layers enforce consistency across the generated views. Each cross-attention layer then operates in two parallel branches: one conditioned on text tokens and the other on retrieved image tokens. Their outputs are fused using a fusion coefficient predicted by the Prior-Guided Attention module.

Paper author Paper submitter

MV-RAG advances multiview generation by combining a pretrained multiview model’s internal knowledge with external visual cues retrieved from a large image database. At inference, the retrieved 2D images are encoded into tokens using an image encoder followed by a learned resampler. Within the multiview diffusion model, 3D self-attention layers enforce consistency across the generated views. Each cross-attention layer then operates in two parallel branches: one conditioned on text tokens and the other on retrieved image tokens. Their outputs are fused using a fusion coefficient predicted by the Prior-Guided Attention module.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.16577 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.16577 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.16577 in a Space README.md to link it from this page.

Collections including this paper 3