arxiv:2502.08826

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Published on Feb 12

· Submitted by

aboots on Feb 18

Upvote

Authors:

Mohammad Mahdi Abootorabi ,

Amirhosein Zobeiri ,

Mahdi Dehghani ,

Mohammadali Mohammadkhani ,

Bardia Mohammadi ,

Omid Ghahroodi ,

Mahdieh Soleymani Baghshah ,

Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.

View arXiv page View PDF Add to collection

Community

aboots

Paper author Paper submitter 3 days ago

•

edited 3 days ago

We are thrilled to introduce our comprehensive Multimodal Retrieval-Augmented Generation (Multimodal RAG) survey. If you’re curious about how dynamic, multimodal external knowledge can be leveraged to overcome the hallucination and outdated knowledge challenges in large language models, this paper is for you.

Our survey dives deep into:

Task Formulation, Datasets, Benchmarks, Evaluation: We cover everything from the foundational tasks and datasets to benchmarks and evaluation methods shaping multimodal AI's future.
Innovative Methodologies: Explore state-of-the-art techniques in retrieval, fusion, augmentation, and generation, along with detailed discussions on training strategies and loss functions.
Structured Taxonomy: Our work introduces a precise taxonomy (see attached Figure) that categorizes current models by their primary contributions, highlighting both methodological advancements and emerging trends.
Research Trends & Future Directions: We identify the gaps and opportunities that are prime for exploration, providing actionable recommendations to guide future research in this rapidly evolving field
Open Resources: To support ongoing research, we are making key datasets, benchmarks, related papers, and innovations publicly available. Please visit the GitHub repository for this survey. We plan to regularly update it with advances in this field and new papers in this domain.

We invite you to read and share our work as we collectively push the boundaries of what AI can achieve. Let’s spark a conversation and drive the next wave of innovation together!