Abstract
How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.
Community
ICLR 2025
X / tweet thread https://x.com/jmhb0/status/1899856949191262454
Project page / blog: https://jmhb0.github.io/viddiff/
Benchmark: https://huggingface.co/datasets/jmhb/VidDiffBench
Code https://github.com/jmhb0/viddiff
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Generative Frame Sampler for Long Video Understanding (2025)
- Towards Fine-Grained Video Question Answering (2025)
- Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! (2025)
- VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation (2025)
- MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos (2025)
- MMVU: Measuring Expert-Level Multi-Discipline Video Understanding (2025)
- HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper