Gaia2 Leaderboard Update: New Models and New Observations

Community Article Published October 2, 2025

🤗 Leaderboard: Models ranking
🤗 Demo: Discover ARE environments and interact with Agents
📄 ArXiv paper: ARE: Scaling Up Agent Environments and Evaluations
💻 Code: ARE Github Repo


image

Figure 1: Gaia2 Overall Scores - using pass@1 evaluation.

Release and feedback

The Gaia2 release has received strong community reception over the past week. There's broad agreement that more realistic and challenging benchmarks are needed to advance agent research—similar to what we've seen with Claude 4.5 Sonnet's results on Tau-Bench. We also launched an open demo with a user interface that lets anyone try the environment hands-on, which has been useful for community engagement.

A common question has been about model coverage: why were certain models included in our initial evaluation while others were not? The answer is simple—thorough evaluation takes significant time and compute resources, and there's a new model every week! This blog post shares results from a new set of models. Looking ahead, we have additional models queued for evaluation, including Claude Sonnet 4.5, GLM, and other model sizes.

image


Analysis of new results

The latest round of evaluations shows that Claude 4 Sonnet Extended Thinking continues to trail GPT-5 (high), but it demonstrates consistent improvements over the standard Claude 4 Sonnet. OpenAI and Anthropic remain leaders overall on the benchmark. We also extended our coverage of open-source models, adding DeepSeek, Qwen 235B in thinking mode, and GPT-OSS 120B in high reasoning mode.

Among these, DeepSeek Terminus delivers a noticeable boost compared to v3.1. This pushes it ahead of Kimi-K2 and narrows the gap with Gemini-2-5-Pro. Qwen3 235B with reasoning enabled, sees a +4 pts gain, comparable to the improvement that reasoning provides to Claude 4 Sonnet. This confirms that explicit reasoning is important for good agentic behavior, although we expected a bigger jump in tasks involving ambiguity. Meanwhile, closed-source frontier models continue to dominate on search-heavy tasks. GPT-OSS 120B shows a very good performance-to-active-parameter ratio, though its scores are somewhat underwhelming relative to what it achieves on other industry benchmarks.

For readability, we only highlight top-line results in this update. The full set of detailed results is available on the leaderboard.

image

Figure 2: Gaia2 scores per capability split. Models are reranked independently for each capability, highlighting where they excel or struggle.


Cool observations

These new evaluations reveal some interesting insights about performance, cost, and efficiency. One surprising element that a few people asked about, is that Claude is overall more expensive than GPT-5 (high). Even though it generates fewer tokens per step, it tends to perform more steps overall. Since input tokens dominate cost (traces are ~100-200K tokens long), Claude ends up being slightly more expensive to run than GPT-5 (high).

image

Figure 3: Left: Gaia2 score vs average scenario cost in USD. Right: Time taken per model to successfully solve Gaia2 scenarios compared to Humans.

image

Figure 4: Left: Gaia2 pass@1 versus average model calls per scenario. The performance of models is highly correlated to the number of tool calls, emphasizing the importance of exploration. Right: Gaia2 pass@1 score versus average output tokens per scenario (log scale)

Quality over Quantity – We also observe that enabling “thinking” can make models better without necessarily increasing costs. For both Qwen and Claude Sonnet, reasoning improves accuracy and can reduce overall cost and execution time. This results in what looks like an inverse scaling law relative to GPT-5: models that produce more tokens per step while reasoning require fewer steps overall, because they make more effective tool calls. In practice, Qwen 235B benefits strongly from this effect, while Claude achieves better performance at roughly the same cost and wall-clock time.

image

Figure 5: Gaia2 budget scaling curve: for each max_budget , we plot P 1{scenario_result = True ∧ scenario_cost < max_budget}. Equipped with a simple ReAct-like scaffold (see Section 2.4), no model evaluated here dominates across the intelligence spectrum—each trades off capability, efficiency, and budget. At equal cost, some models fare better, yet all curves plateau, suggesting that standard scaffolds and/or models miss key ingredients for sustained progress. Cost estimates from Artificial Analysis model pricing data (accessed September 30, 2025).

Small Budget Models – Another important observation comes from budget scaling. With new OSS models added, we see new contenders at lower budget tiers. GPT-OSS 120B in high reasoning mode leads the pack when resources are extremely limited, while DeepSeek V3.1 Terminus slightly edges out Kimi-K2. Depending on the use case, different models may be preferable: for example, in search-heavy scenarios, Gemini-2-5-Pro remains significantly stronger than both DeepSeek and Kimi-K2.


Next steps

This update represents only a small part of our ongoing evaluation efforts with Gaia2 and ARE. We know that the community is eager to see results from other frontier agentic models, and we are doing everything we can to make those numbers available quickly.

Community

Sign up or log in to comment