Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
Abstract
Diffusion-Link, a diffusion-based modality-bridging module, reduces the audio-text modality gap and enhances multimodal encoder-LLM coupling, achieving state-of-the-art performance in automatic audio captioning.
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WAVE: Learning Unified&Versatile Audio-Visual Embeddings with Multimodal LLM (2025)
- JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation (2025)
- SPANER: Shared Prompt Aligner for Multimodal Semantic Representation (2025)
- Training-Free Multimodal Guidance for Video to Audio Generation (2025)
- Spatial-CLAP: Learning Spatially-Aware audio--text Embeddings for Multi-Source Conditions (2025)
- RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing (2025)
- MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper