arxiv:2508.17472

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Published on Aug 24

· Submitted by

Kaiyue on Aug 26

Upvote

Authors:

Kaiyue Sun ,

Abstract

T2I-ReasonBench evaluates the reasoning capabilities of text-to-image models across four dimensions using a two-stage protocol, analyzing their performance comprehensively.

AI-generated summary

We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.

View arXiv page View PDF GitHub 24 Add to collection

Community

Kaiyue

Paper author Paper submitter 2 days ago

This paper introduces T2I-ReasonBench, a novel benchmark designed to explore the reasoning border of T2I models. T2I-ReasonBench comprises 800 meticulously designed prompts organized into four dimensions: (1) Idiom Interpretation, (2) Textual Image Design, (3) Entity-Reasoning, and (4) Scientific-Reasoning. These dimensions challenge models to infer latent meaning, integrate domain knowledge, and resolve contextual ambiguities. To quantify the performance, we introduce a two-stage evaluation framework: a large language model (LLM) generates prompt-specific question-criterion pairs that evaluate if the image includes the essential elements resulting from correct reasoning; a multimodal LLM (MLLM) then scores the generated image against these criteria. Experiments across 14 state-of-the-art T2I models reveal that open-source models exhibit critical limitations in reasoning-informed generation, while proprietary models like GPT-Image-1 demonstrate stronger reasoning and knowledge integration. Our findings underscore the necessity to improve reasoning capabilities in next-generation T2I systems. This work provides a foundational benchmark and evaluation protocol to guide future research towards reasoning-informed T2I synthesis.