Abstract
MV-RAG enhances text-to-3D generation by retrieving 2D images and conditioning a multiview diffusion model to improve consistency and accuracy, especially for out-of-domain concepts.
Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.
Community
MV-RAG extends the strengths of RAG by addressing challenges such as out-of-domain generations (e.g., ‘Bolognese dog’) and emerging concepts introduced after training (e.g., ‘Labubu doll’).
MV-RAG advances multiview generation by combining a pretrained multiview model’s internal knowledge with external visual cues retrieved from a large image database. At inference, the retrieved 2D images are encoded into tokens using an image encoder followed by a learned resampler. Within the multiview diffusion model, 3D self-attention layers enforce consistency across the generated views. Each cross-attention layer then operates in two parallel branches: one conditioned on text tokens and the other on retrieved image tokens. Their outputs are fused using a fusion coefficient predicted by the Prior-Guided Attention module.
MV-RAG advances multiview generation by combining a pretrained multiview model’s internal knowledge with external visual cues retrieved from a large image database. At inference, the retrieved 2D images are encoded into tokens using an image encoder followed by a learned resampler. Within the multiview diffusion model, 3D self-attention layers enforce consistency across the generated views. Each cross-attention layer then operates in two parallel branches: one conditioned on text tokens and the other on retrieved image tokens. Their outputs are fused using a fusion coefficient predicted by the Prior-Guided Attention module.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting (2025)
- DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing (2025)
- SeqTex: Generate Mesh Textures in Video Sequence (2025)
- Stable-Hair v2: Real-World Hair Transfer via Multiple-View Diffusion Model (2025)
- Align 3D Representation and Text Embedding for 3D Content Personalization (2025)
- RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution (2025)
- Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper