arxiv:2510.06071

Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

Published on Oct 7

· Submitted by

João Palmeiro on Oct 8

Upvote

Authors:

João Palmeiro ,

Abstract

A benchmark for scatterplot-specific tasks using synthetic datasets evaluates AI models' performance in counting clusters and identifying outliers, with mixed results for localization tasks.

AI-generated summary

AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.

View arXiv page View PDF GitHub 4 Add to collection

Community

joaompalmeiro

Paper author Paper submitter 3 days ago

•

edited 3 days ago

Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

$Preview of the new scatterplot dataset and a list of the main features of the dataset and benchmark. The top section shows six of the scatterplots paired with their annotated versions. These versions are layered with cluster bounding boxes, cluster centers, and outliers. From left to right, the first scatterplot is composed of six clusters spread diagonally from the top-left to the bottom-right corner; the second is composed of four scattered, elongated clusters and background noise; the third has one cluster in the top-left corner and four outliers near the top-right and bottom-right corners; the fourth has three clusters that form a triangle; the fifth has a set of exponentially related points; the sixth has two spread out clusters, one on top and one on the bottom, and background noise. The bottom section, on the left, notes that the dataset is a new synthetic dataset, composed of over 18,000 annotated scatterplots and 17 chart designs. In the middle, it lists that the benchmark is composed of 1725 scatterplots, 10 models from OpenAI and Google, 5 tasks, and 3 prompting strategies. On the right, there are several examples of actual model responses next to the question: "What performance?". The examples are: {"clusters": [], "outliers": []}, (5), (38), (2), (100), {"cluster_centers": [[278, 747], [768, 250]]}, (0), (1), {"outliers": [[50, 50], [950, 50], [950, 950], [50, 950]]}, and {"clusters": [[66, 27, 438, 224], [466, 370, 927, 570]]}.$

Contributions

A synthetic, annotated dataset (and its generation pipeline) for scatterplot-related tasks.
A comprehensive evaluation of the performance of ten proprietary models on said tasks.
A list of considerations when designing charts and providing them as input to AI models.

Quickstart

Paper: https://arxiv.org/abs/2510.06071
Dataset and benchmark: https://github.com/feedzai/biy-paper

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.06071 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.06071 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.06071 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.