Papers
arxiv:2502.08826

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Published on Feb 12
· Submitted by aboots on Feb 18

Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.

Community

Paper author Paper submitter
edited 3 days ago

We are thrilled to introduce our comprehensive Multimodal Retrieval-Augmented Generation (Multimodal RAG) survey. If you’re curious about how dynamic, multimodal external knowledge can be leveraged to overcome the hallucination and outdated knowledge challenges in large language models, this paper is for you.

Our survey dives deep into:

  • Task Formulation, Datasets, Benchmarks, Evaluation: We cover everything from the foundational tasks and datasets to benchmarks and evaluation methods shaping multimodal AI's future.

  • Innovative Methodologies: Explore state-of-the-art techniques in retrieval, fusion, augmentation, and generation, along with detailed discussions on training strategies and loss functions.

  • Structured Taxonomy: Our work introduces a precise taxonomy (see attached Figure) that categorizes current models by their primary contributions, highlighting both methodological advancements and emerging trends.

  • Research Trends & Future Directions: We identify the gaps and opportunities that are prime for exploration, providing actionable recommendations to guide future research in this rapidly evolving field

  • Open Resources: To support ongoing research, we are making key datasets, benchmarks, related papers, and innovations publicly available. Please visit the GitHub repository for this survey. We plan to regularly update it with advances in this field and new papers in this domain.

We invite you to read and share our work as we collectively push the boundaries of what AI can achieve. Let’s spark a conversation and drive the next wave of innovation together!

arXiv GitHub

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.08826 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.08826 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.08826 in a Space README.md to link it from this page.

Collections including this paper 2