Papers
arxiv:2506.22967

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Published on Jun 28
Authors:
,

Abstract

ActAlign, a zero-shot video classification framework, uses sequence alignment and large language models to achieve high accuracy on fine-grained video classification without video-text supervision.

AI-generated summary

We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.22967 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.22967 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.22967 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.