arxiv:2505.06280

Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation

Published on May 6

Authors:

Gabriele Rosi ,

Abstract

SoT benchmark evaluates textual and visual prompts for semantic segmentation, revealing strengths and weaknesses across different domains.

AI-generated summary

Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open-vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land-cover). We evaluate 5 open-vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi-class segmentation through a confidence-based mask merging strategy. Our extensive experiments reveal that open-vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high variability depending on the input prompt. Through comprehensive quantitative and qualitative analysis, we identify the strengths and weaknesses of both prompting modalities, providing valuable insights to guide future research in vision foundation models for segmentation tasks.

View arXiv page View PDF GitHub repository Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.06280 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.06280 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.06280 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.