File size: 4,993 Bytes
8ce065f 9b83b80 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
license: mit
---
# π Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
# NeurIPS 2025
> [Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation](https://arxiv.org/pdf/2505.14705?).<br>
> [Xin Zhang](https://zhangxin-xd.github.io/), Ziruo Zhang, [Jiawei Du](https://scholar.google.com/citations?user=WrJKEzEAAAAJ&hl=zh-CN), [Zuozhu Liu](https://person.zju.edu.cn/en/lzz), [Joey Tianyi Zhou](https://joeyzhouty.github.io/) <br>
> Agency for Science, Technology, and Research (ASTAR), Singapore <br>
> National University of Singapore, Singapore <br>
> Zhejiang University, China <br>
## π Introduction
<p align="center">
<img src="imgs/problem.png" alt="problem" title="problem" width="700">
</p>
<p align="justify">
<strong> Multimodal embedding distributions across various distillation methods </strong>:
We extract image and text embeddings from a finetuned CLIP and project them into a shared representation space using DOSNES.
Red triangles and blue circles denote image and text embeddings, respectively.
Left: Embeddings from randomly sampled data in the original dataset exhibit a well-spread and modality-aligned distribution.
Middle: The distilled dataset generated by a sota MDD method (LoRS) leads to Modality Collapse, where image and text embeddings are poorly aligned and concentrated in distinct regions.
Right: Our method effectively mitigates modality collapse, yielding a distribution that better preserves cross-modal alignment and exhibits greater representational diversity.
</p>
## βοΈ Installation
To get started, follow these instructions to set up the environment and install dependencies.
1. **Clone this repository**:
```bash
git clone https://github.com/zhangxin-xd/RepBlend.git
cd RepBlend
```
2. **Install required packages**:
```
conda create -n RepBlend python=3.10
conda activate RepBlend
pip install -r requirements.txt
```
---
## π Usage
Hereβs how to use RepBlend for Multimodal Dataset Distillation:
First, download the pretrained weights and datasets and place them into their respective folders.
### Pretrained Weights
The checkpoints for all experimental networks are available from their respective official repositories. For convenience, we have also provided them together [π€ here](https://huggingface.co/xinxin66/RepBlend).
Once downloaded, put them in `distill_utils/checkpoints/`.
### Experimental Datasets
The dataset hase been validated on various benchmarks, you can download from their respective links. Once downloaded, put them in `distill_utils/data/`.
| datasets | links|
|-----|-----|
| Flickr30K | [images](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset), [π€ annotations](https://huggingface.co/xinxin66/RepBlend/)|
| COCO | [images](https://cocodataset.org/#download), [π€ annotations](https://huggingface.co/xinxin66/RepBlend) |
|LLaVA-cc3m|[images](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md), [π€ annotations](https://huggingface.co/xinxin66/RepBlend)|
### Generate Expert Trajectories
You can generate expert trajectories by running the `scripts/buffer.sh`, or alternatively, download our [pre-generated trajectories](π€ https://huggingface.co/xinxin66/RepBlend) for faster reproduction.
```
bash scripts/buffer.sh
```
### Distill Multimodal Dataset
You can distill multimodal datasets with RepBlend by running `scripts/distill_coco_repblend.sh` and `scripts/distill_flickr_repblend.sh`.
```
bash scripts/distill_coco_repblend.sh
bash scripts/distill_flickr_repblend.sh
```
## π Results
Our experiments demonstrate the effectiveness of the proposed approach across various benchmarks.
<div style="display: flex; justify-content: center; align-items: center;">
<img src="imgs/results 1.png" alt="Results 1" width="800"/>
</div>
<br>
<div style="display: flex; justify-content: center; align-items: center;">
<img src="imgs/table 1.png" alt="table 1" width="400"/>
<img src="imgs/table 2.png" alt="table 2" width="400"/>
</div>
For detailed experimental results and further analysis, please refer to the full paper.
---
## π Citation
If you find this code useful in your research, please consider citing our work:
```bibtex
@inproceedings{RepBlend2025neurips,
title={Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation},
author={Zhang, Xin and Zhang, Ziruo, and Du, Jiawei and Liu, Zuozhu and Zhou, Joey Tianyi},
booktitle={Adv. Neural Inf. Process. Syst. (NeurIPS)},
year={2025}
}
```
---
## π Reference
Our code has referred to previous works:
- [LoRS: Low-Rank Similarity Mining](https://github.com/silicx/LoRS_Distill)
- [Vision-Language Dataset Distillation](https://github.com/princetonvisualai/multimodal_dataset_distillation)
- [Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory (TESLA)](https://github.com/justincui03/tesla)
|