← Previous · All Episodes · Next →
MV-RAG: Retrieval Augmented Multiview Diffusion Episode 1094

MV-RAG: Retrieval Augmented Multiview Diffusion

· 20:32

|

🤗 Upvotes: 31 | cs.CV, cs.AI

Authors:
Yosef Dayani, Omer Benishu, Sagie Benaim

Title:
MV-RAG: Retrieval Augmented Multiview Diffusion

Arxiv:
http://arxiv.org/abs/2508.16577v1

Abstract:
Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.


Subscribe

Listen to Daily Paper Cast using one of many popular podcasting apps or directories.

Apple Podcasts Spotify Overcast Pocket Casts YouTube
← Previous · All Episodes · Next →