Abstract
VISTA, a multi-agent system, iteratively refines user prompts to enhance video quality and alignment with user intent, outperforming existing methods.
Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoScore2: Think before You Score in Generative Video Evaluation (2025)
- Maestro: Self-Improving Text-to-Image Generation via Agent Orchestration (2025)
- No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation (2025)
- TTOM: Test-Time Optimization and Memorization for Compositional Video Generation (2025)
- DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis (2025)
- VChain: Chain-of-Visual-Thought for Reasoning in Video Generation (2025)
- Code2Video: A Code-centric Paradigm for Educational Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper