π Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
NeurIPS 2025
Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation.
Xin Zhang, Ziruo Zhang, Jiawei Du, Zuozhu Liu, Joey Tianyi Zhou
Agency for Science, Technology, and Research (ASTAR), Singapore
National University of Singapore, Singapore
Zhejiang University, China
π Introduction
Multimodal embedding distributions across various distillation methods : We extract image and text embeddings from a finetuned CLIP and project them into a shared representation space using DOSNES. Red triangles and blue circles denote image and text embeddings, respectively. Left: Embeddings from randomly sampled data in the original dataset exhibit a well-spread and modality-aligned distribution. Middle: The distilled dataset generated by a sota MDD method (LoRS) leads to Modality Collapse, where image and text embeddings are poorly aligned and concentrated in distinct regions. Right: Our method effectively mitigates modality collapse, yielding a distribution that better preserves cross-modal alignment and exhibits greater representational diversity.
βοΈ Installation
To get started, follow these instructions to set up the environment and install dependencies.
Clone this repository:
git clone https://github.com/zhangxin-xd/RepBlend.git cd RepBlend
Install required packages:
conda create -n RepBlend python=3.10 conda activate RepBlend pip install -r requirements.txt
π Usage
Hereβs how to use RepBlend for Multimodal Dataset Distillation:
First, download the pretrained weights and datasets and place them into their respective folders.
Pretrained Weights
The checkpoints for all experimental networks are available from their respective official repositories. For convenience, we have also provided them together π€ here.
Once downloaded, put them in distill_utils/checkpoints/
.
Experimental Datasets
The dataset hase been validated on various benchmarks, you can download from their respective links. Once downloaded, put them in distill_utils/data/
.
datasets | links |
---|---|
Flickr30K | images, π€ annotations |
COCO | images, π€ annotations |
LLaVA-cc3m | images, π€ annotations |
Generate Expert Trajectories
You can generate expert trajectories by running the scripts/buffer.sh
, or alternatively, download our [pre-generated trajectories](π€ https://huggingface.co/xinxin66/RepBlend) for faster reproduction.
bash scripts/buffer.sh
Distill Multimodal Dataset
You can distill multimodal datasets with RepBlend by running scripts/distill_coco_repblend.sh
and scripts/distill_flickr_repblend.sh
.
bash scripts/distill_coco_repblend.sh
bash scripts/distill_flickr_repblend.sh
π Results
Our experiments demonstrate the effectiveness of the proposed approach across various benchmarks.



For detailed experimental results and further analysis, please refer to the full paper.
π Citation
If you find this code useful in your research, please consider citing our work:
@inproceedings{RepBlend2025neurips,
title={Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation},
author={Zhang, Xin and Zhang, Ziruo, and Du, Jiawei and Liu, Zuozhu and Zhou, Joey Tianyi},
booktitle={Adv. Neural Inf. Process. Syst. (NeurIPS)},
year={2025}
}
π Reference
Our code has referred to previous works: