U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

This repository contains the official model checkpoints and inference code for U-MARVEL, presented in the paper U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs.

Universal multimodal retrieval (UMR) addresses complex retrieval tasks involving diverse modalities for both queries and candidates. Despite the success of state-of-the-art methods based on multimodal large language models (MLLMs) using contrastive learning principles, the mechanisms underlying their retrieval capabilities remain largely unexplored. This gap potentially leads to suboptimal performance and limited generalization ability.

In this study, we systematically analyze the key factors driving effective embedding learning for UMR using MLLMs. We implement a general MLLM-based embedding learning pipeline and investigate contributors to high-performing universal retrieval systems. Our analysis covers various aspects of embedding generation and training strategies, including progressive transition, hard negative mining, and re-ranker distillation. Our findings reveal that often-overlooked factors can significantly impact model performance.

Building on these insights, we introduce U-MARVEL (Universal Multimodal Retrieval via Embedding Learning), a unified framework that outperforms state-of-the-art competitors on the M-BEIR benchmark in supervised settings and demonstrates strong zero-shot performance on tasks such as composed image retrieval and text-to-video retrieval. These results highlight the generalization potential of our framework across various embedding-based retrieval tasks, providing valuable insights for future research.

Model Checkpoints

├── checkpoints
│   ├── hf_models
│   │   └── Qwen2-VL-7B-Instruct
│   └── U-MARVEL-Qwen2VL-7B-Instruct

Requirements

To install requirements:

pip install -r requirements.txt

Data Preparation

Download Qwen2-VL-7B and place it in ./checkpoints/hf_models/Qwen2-VL-7B-Instruct

For NLI dataset, please refer to link

For multimodal instruction tuning datset, please refer to M-BEIR

After downloading all of them, organize the data as follows in ./data

├── data    
│    ├── M-BEIR
│    ├── nli_for_simcse.csv
│    ├── rerank_data_for_training
│    ├── flickr
│    ├── coco
│    ├── sharegpt4v
│    ├── Urban1K
│    ├── circo
│    ├── genecis
│    ├── vist
│    ├── visdial
│    ├── ccneg
│    ├── sugar-crepe
│    ├── MSVD
│    └── msrvtt

Evaluation

To evaluate our model on M-BEIR, run:

python scripts/vtools_eval_mbeir_model.py  # Evaluate locally  
sh scripts/eval_mbeir_global.sh            # Evaluate globally  
sh scripts/eval_zeroshot.sh                # Evaluate zero-shot

Model Performance

The proposed U-MARVEL framework establishes new state-of-the-art performance across both single-model architectures and recall-then-rerank approaches on M-BEIR benchmark.

M-BEIR-Local M-BEIR-Global M-BEIR-Zero-shot M-BEIR-Zero-shot

Acknowledgements

Many thanks to the code bases from LamRA .

Citation

If you use this code for your research or project, please cite:

@article{li2025umarvel,
  title={U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs},
  author={Li, Xiaojie and Li, Chu and Chen, Shi-Zhe and Chen, Xi},
  journal={arXiv preprint arXiv:2507.14902},
  year={2025}
}
Downloads last month
1,087
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TencentBAC/U-MARVEL-Qwen2VL-7B-Instruct

Base model

Qwen/Qwen2-VL-7B
Finetuned
(396)
this model

Dataset used to train TencentBAC/U-MARVEL-Qwen2VL-7B-Instruct

Collection including TencentBAC/U-MARVEL-Qwen2VL-7B-Instruct