T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation
Abstract
T2I-ReasonBench evaluates the reasoning capabilities of text-to-image models across four dimensions using a two-stage protocol, analyzing their performance comprehensively.
We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.
Community
This paper introduces T2I-ReasonBench, a novel benchmark designed to explore the reasoning border of T2I models. T2I-ReasonBench comprises 800 meticulously designed prompts organized into four dimensions: (1) Idiom Interpretation, (2) Textual Image Design, (3) Entity-Reasoning, and (4) Scientific-Reasoning. These dimensions challenge models to infer latent meaning, integrate domain knowledge, and resolve contextual ambiguities. To quantify the performance, we introduce a two-stage evaluation framework: a large language model (LLM) generates prompt-specific question-criterion pairs that evaluate if the image includes the essential elements resulting from correct reasoning; a multimodal LLM (MLLM) then scores the generated image against these criteria. Experiments across 14 state-of-the-art T2I models reveal that open-source models exhibit critical limitations in reasoning-informed generation, while proprietary models like GPT-Image-1 demonstrate stronger reasoning and knowledge integration. Our findings underscore the necessity to improve reasoning capabilities in next-generation T2I systems. This work provides a foundational benchmark and evaluation protocol to guide future research towards reasoning-informed T2I synthesis.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation (2025)
- DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models (2025)
- LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation (2025)
- Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation (2025)
- LVLM-Composer's Explicit Planning for Image Generation (2025)
- Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation (2025)
- Trade-offs in Image Generation: How Do Different Dimensions Interact? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper