Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks
Abstract
A benchmark for scatterplot-specific tasks using synthetic datasets evaluates AI models' performance in counting clusters and identifying outliers, with mixed results for localization tasks.
AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.
Community
Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks
Contributions
- A synthetic, annotated dataset (and its generation pipeline) for scatterplot-related tasks.
- A comprehensive evaluation of the performance of ten proprietary models on said tasks.
- A list of considerations when designing charts and providing them as input to AI models.
Quickstart
- Paper: https://arxiv.org/abs/2510.06071
- Dataset and benchmark: https://github.com/feedzai/biy-paper
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information (2025)
- Are LLMs ready to help non-expert users to make charts of official statistics data? (2025)
- Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions (2025)
- Meet Your New Client: Writing Reports for AI - Benchmarking Information Loss in Market Research Deliverables (2025)
- Can we Evaluate RAGs with Synthetic Data? (2025)
- Is this chart lying to me? Automating the detection of misleading visualizations (2025)
- Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper