Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Abstract
Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.
Community
We are thrilled to introduce our comprehensive Multimodal Retrieval-Augmented Generation (Multimodal RAG) survey. If you’re curious about how dynamic, multimodal external knowledge can be leveraged to overcome the hallucination and outdated knowledge challenges in large language models, this paper is for you.
Our survey dives deep into:
Task Formulation, Datasets, Benchmarks, Evaluation: We cover everything from the foundational tasks and datasets to benchmarks and evaluation methods shaping multimodal AI's future.
Innovative Methodologies: Explore state-of-the-art techniques in retrieval, fusion, augmentation, and generation, along with detailed discussions on training strategies and loss functions.
Structured Taxonomy: Our work introduces a precise taxonomy (see attached Figure) that categorizes current models by their primary contributions, highlighting both methodological advancements and emerging trends.
Research Trends & Future Directions: We identify the gaps and opportunities that are prime for exploration, providing actionable recommendations to guide future research in this rapidly evolving field
Open Resources: To support ongoing research, we are making key datasets, benchmarks, related papers, and innovations publicly available. Please visit the GitHub repository for this survey. We plan to regularly update it with advances in this field and new papers in this domain.
We invite you to read and share our work as we collectively push the boundaries of what AI can achieve. Let’s spark a conversation and drive the next wave of innovation together!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs (2024)
- MRAMG-Bench: A BeyondText Benchmark for Multimodal Retrieval-Augmented Multimodal Generation (2025)
- VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos (2025)
- AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles (2024)
- ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models (2025)
- Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey (2025)
- Large Multimodal Models for Low-Resource Languages: A Survey (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper