Major PHYBench Update Released

Community Article Published May 25, 2025

The PHYBench project has undergone a significant update, introducing comprehensive upgrades in both platform functionality and experimental research design. This release aims to further advance the evaluation and understanding of physical reasoning capabilities in AI models.

(1) New Platform Launch

We have officially released a new interactive website at https://www.phybench.cn/, featuring:

  • A visualized leaderboard of 20 mainstream models evaluated on PHYBench, reporting both Accuracy and EED Score, with detailed breakdowns across physics subfields

  • An event timeline module, documenting the key milestones in the development of PHYBench, allowing users to follow the evolution of the dataset and its evaluation framework.

(2) Experimental Enhancements and Paper Reorganization

We restructured the paper and added key experiments to further demonstrate PHYBench’s robustness and significance as a high-quality benchmark.

View our new version of paper at https://arxiv.org/abs/2504.16074v2

Evaluation Quality Verification

  • PHYBench problems consume far more tokens than existing benchmarks—including competition-level datasets—highlighting their greater complexity.

  • Model scores on PHYBench are generally lower and exhibit a more distinguishable distribution, making it easier to differentiate between varying reasoning capabilities.

  • Test-time scaling experiments show consistent upward trends across models with increased sampling, confirming order-preserving performance and robust score scaling, further validating PHYBench as a reliable evaluation benchmark.

Error Localization in Model Reasoning

Our analysis reveals that current models are competent at both ends of the problem-solving pipeline: they can understand the problem statement and carry out symbolic manipulations on given equations. However, they struggle with the intermediate step of applying physical laws to construct new equations.

This issue primarily stems from insufficient semantic reasoning capabilities, i.e., models often fail to fully grasp the meaning and applicability of physical laws, leading to frequent misuse of formulas.

Reasoning Pattern Analysis: Superficial Reasoning

We define Superficial Reasoning as model behavior where answers are derived through pattern matching (e.g., recalling specific intermediate conclusions or solution steps) rather than genuine understanding of physical principles.

To investigate this, we designed a systematic perturbation experiment. By injecting targeted errors into otherwise correct solution chains (e.g., modifying physical laws, tampering with semantic analysis, or altering equations), we evaluated the models' robustness and error correction capabilities.

Based on the results, we categorize model reasoning behavior into three types:

  • Superficial Reasoning: The model follows the perturbed reasoning chain without correction, unable to detect or recover from errors. This pattern is typical of non-reasoning models (e.g., GPT-4o, DeepSeek-V3) and early-stage reasoning models (e.g., o1-preview).

  • Pseudo-genuine Reasoning: The model exhibits partial robustness by employing specific detection heuristics. For instance, the DeepSeek-R1 model performs dimensional analysis and divergence checks on physical quantities to stabilize its responses at the equation level. However, it remains fragile with respect to semantic reasoning. Gemini 2.5 Pro avoids semantic reasoning altogether by relying on massive formal derivations and large-scale equation systems. While this yields high robustness, it lacks semantic interpretability.

  • Genuine Reasoning (aspirational direction): The model is capable of reflecting on and correcting errors based on physical understanding, showing stronger and more consistent reasoning performance under perturbations.


We will continue to advance PHYBench in the directions of benchmark methodology, reasoning behavior characterization, and in-depth model capability analysis. We welcome feedback and participation from researchers and practitioners. The website and evaluation results will be updated regularly, and we look forward to your insights and collaboration.

Community

Sign up or log in to comment