Abstract
A new benchmark, IF-VidCap, evaluates video captioning models on instruction-following capabilities, revealing that top-tier open-source models are closing the performance gap with proprietary models.
Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
Community
🎯 First Instruction-Following Video Captioning Benchmark: 1,400 complex, compositional instructions aligned with real-world downstream applications
🔍 Robust Evaluation Protocol: Multi-dimensional evaluation combining rule-based and LLM-based checks
📊 Comprehensive Analysis: Evaluation of 20+ state-of-the-art models with detailed insights
📚 Training Dataset: Curated dataset for fine-grained instruction-based control
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward (2025)
- VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding (2025)
- OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs (2025)
- Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark (2025)
- LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding (2025)
- CapGeo: A Caption-Assisted Approach to Geometric Reasoning (2025)
- MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper