Abstract
InnoGym is a benchmark and framework that evaluates the innovation potential of AI agents using performance gain and novelty metrics, highlighting a gap between creativity and effectiveness.
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
Community
We present InnoGym, the benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches.
The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection.
In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth (2025)
- U2F: Encouraging SWE-Agent to Seize Novelty without Losing Feasibility (2025)
- MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline (2025)
- SelfAI: Building a Self-Training AI System with LLM Agents (2025)
- FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration (2025)
- CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions (2025)
- The FM Agent (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper