Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Abstract
Nyx, a unified mixed-modal retriever, enhances vision-language generation by retrieving and reasoning over mixed-modal data, outperforming existing RAG systems in real-world scenarios.
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.
Community
We proposed Nyx, a unified mixed-modal retriever tailored for URAG scenarios, and constructed NyxQA, a large-scale mixed-modal QA dataset. Our framework includes:
- A four-stage automated pipeline for generating realistic multimodal QA pairs.
- A two-stage training framework combining pre-training on NyxQA and supervised fine-tuning with VLM feedback.
- Strong performance on both text-only RAG benchmarks and vision-language URAG tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VeritasFi: An Adaptable, Multi-tiered RAG Framework for Multi-modal Financial Question Answering (2025)
- CMRAG: Co-modality-based visual document retrieval and question answering (2025)
- Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding (2025)
- Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation (2025)
- UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG (2025)
- Generalized Contrastive Learning for Universal Multimodal Retrieval (2025)
- MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper