Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
Abstract
PRIS adaptively revises prompts during inference to enhance alignment with user intent in text-to-visual generation, improving accuracy and quality.
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.
Community
Scaling visuals with prompts redesigned for the scaled outputs → break the plateau.
Simply scaling visuals with a fixed prompt quickly hits a performance ceiling - generations repeatedly exhibit the same recurring failure patterns even as compute grows. By redesigning the prompt to directly address these failure patterns at the new visual scale, we break through this plateau, achieving steadily improving generations and much higher prompt-adherence for both seen and unseen rewards as compute scales.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling (2025)
- Compositional Image Synthesis with Inference-Time Scaling (2025)
- Improving Text-to-Image Generation with Input-Side Inference-Time Scaling (2025)
- Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation (2025)
- Planning with Sketch-Guided Verification for Physics-Aware Video Generation (2025)
- GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation (2025)
- Personalized Reward Modeling for Text-to-Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper