Abstract
MovieCORE is a video question answering dataset that uses multiple large language models to generate deep cognitive questions, and introduces an agentic enhancement module to improve VQA model performance.
This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.
Community
Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Long Video Question Answering with Scene-Localized Frame Grouping (2025)
- Team of One: Cracking Complex Video QA with Model Synergy (2025)
- EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering (2025)
- CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks (2025)
- HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding (2025)
- MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks (2025)
- HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper