arxiv:2508.01711

GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval

Published on Aug 3

Authors:

Abstract

GAID framework enhances text-to-video retrieval by adaptively fusing audio and visual features and perturbing text embeddings, leading to state-of-the-art results.

AI-generated summary

Text-to-video retrieval requires precise alignment between language and temporally rich video signals. Existing methods predominantly exploit visual cues and often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. We present GAID, a framework that jointly address this gap via two key components: (i) a Frame-level Gated Fusion (FGF) that adaptively integrates audio and visual features under textual guidance, enabling fine-grained temporal alignment; and (ii) a Directional Adaptive Semantic Perturbation (DASP) that injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference. These modules complement each other -- fusion reduces modality gaps while perturbation regularizes cross-modal matching -- yielding more stable and expressive representations. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results across all retrieval metrics with notable efficiency gains. Our code is available at https://github.com/YangBowenn/GAID.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.01711 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.01711 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.01711 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.