File size: 4,993 Bytes
8ce065f
 
 
 
9b83b80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: mit
---

# 🌟 Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
# NeurIPS 2025 
> [Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation](https://arxiv.org/pdf/2505.14705?).<br>
> [Xin Zhang](https://zhangxin-xd.github.io/), Ziruo Zhang, [Jiawei Du](https://scholar.google.com/citations?user=WrJKEzEAAAAJ&hl=zh-CN), [Zuozhu Liu](https://person.zju.edu.cn/en/lzz), [Joey Tianyi Zhou](https://joeyzhouty.github.io/) <br>
> Agency for Science, Technology, and Research (ASTAR), Singapore <br>
> National University of Singapore, Singapore <br>
> Zhejiang University, China <br>
## πŸ“– Introduction
<p align="center">
  <img src="imgs/problem.png" alt="problem" title="problem" width="700">
</p>

<p align="justify">
  <strong> Multimodal embedding distributions across various distillation methods </strong>:
  We extract image and text embeddings from a finetuned CLIP and project them into a shared representation space using DOSNES. 
  Red triangles and blue circles denote image and text embeddings, respectively. 
  Left: Embeddings from randomly sampled data in the original dataset exhibit a well-spread and modality-aligned distribution. 
  Middle: The distilled dataset generated by a sota MDD method (LoRS) leads to Modality Collapse, where image and text embeddings are poorly aligned and concentrated in distinct regions. 
  Right: Our method effectively mitigates modality collapse, yielding a distribution that better preserves cross-modal alignment and exhibits greater representational diversity.
</p>

## βš™οΈ Installation

To get started, follow these instructions to set up the environment and install dependencies.

1. **Clone this repository**:
    ```bash
    git clone https://github.com/zhangxin-xd/RepBlend.git
    cd RepBlend
    ```

2. **Install required packages**:
    ```
    conda create -n RepBlend python=3.10
    conda activate RepBlend
    pip install -r requirements.txt
    ```
---

## πŸš€ Usage

Here’s how to use RepBlend for Multimodal Dataset Distillation:

First, download the pretrained weights and datasets and place them into their respective folders.
### Pretrained Weights
The checkpoints for all experimental networks are available from their respective official repositories. For convenience, we have also provided them together [πŸ€— here](https://huggingface.co/xinxin66/RepBlend).
Once downloaded, put them in `distill_utils/checkpoints/`.

### Experimental Datasets
The dataset hase been validated on various benchmarks, you can download from  their respective links. Once downloaded, put them in `distill_utils/data/`.
| datasets | links| 
|-----|-----|
| Flickr30K | [images](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset), [πŸ€— annotations](https://huggingface.co/xinxin66/RepBlend/)|
| COCO | [images](https://cocodataset.org/#download), [πŸ€— annotations](https://huggingface.co/xinxin66/RepBlend) |
|LLaVA-cc3m|[images](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md), [πŸ€— annotations](https://huggingface.co/xinxin66/RepBlend)|

### Generate Expert Trajectories
You can generate expert trajectories by running the `scripts/buffer.sh`, or alternatively, download our [pre-generated trajectories](πŸ€— https://huggingface.co/xinxin66/RepBlend) for faster reproduction.
```
bash scripts/buffer.sh
```
### Distill Multimodal Dataset
You can distill multimodal datasets with RepBlend by running `scripts/distill_coco_repblend.sh` and `scripts/distill_flickr_repblend.sh`.
```
bash scripts/distill_coco_repblend.sh
bash scripts/distill_flickr_repblend.sh
```

## πŸ“Š Results

Our experiments demonstrate the effectiveness of the proposed approach across various benchmarks. 
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="imgs/results 1.png" alt="Results 1" width="800"/>
</div>
<br>
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="imgs/table 1.png" alt="table 1" width="400"/>
    <img src="imgs/table 2.png" alt="table 2" width="400"/>
</div>

For detailed experimental results and further analysis, please refer to the full paper.

---

## πŸ“‘ Citation

If you find this code useful in your research, please consider citing our work:

```bibtex
@inproceedings{RepBlend2025neurips,
    title={Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation},
    author={Zhang, Xin and Zhang, Ziruo, and Du, Jiawei and Liu, Zuozhu and Zhou, Joey Tianyi},
    booktitle={Adv. Neural Inf. Process. Syst. (NeurIPS)},
    year={2025}
}
```
---
## πŸŽ‰ Reference
Our code has referred to previous works:
- [LoRS: Low-Rank Similarity Mining](https://github.com/silicx/LoRS_Distill)
- [Vision-Language Dataset Distillation](https://github.com/princetonvisualai/multimodal_dataset_distillation)
- [Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory (TESLA)](https://github.com/justincui03/tesla)