Papers
arxiv:2508.18124

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Published on Aug 25
· Submitted by weidawang on Aug 27
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
Wei Ma ,
,
,
,
,
,
,

Abstract

CMPhysBench evaluates LLMs in condensed matter physics using calculation problems and a new SEED score for partial credit assessment, revealing significant capability gaps.

AI-generated summary

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

Community

Paper author Paper submitter
edited 1 day ago

🚀 Can Large Language Models Pass Grad-Level Condensed Matter Physics?

We just released CMPhysBench, a brand-new open-source benchmark!

Highlights:

  • 🔬 520 graduate-level problems in condensed matter physics — spanning magnetism, superconductivity, semiconductors, and strongly correlated systems
  • 📖 Curated from 17 authoritative textbooks, written & reviewed by PhD students and postdocs
  • 🧮 Introducing SEED (Scalable Expression Edit Distance) — a smarter metric that gives partial credit for “almost correct” answers, across all these answer types, instead of simple right-or-wrong grading
  • 🤖 Tested on 18 major LLMs (GPT-4o, Claude 3.7, Gemini, Grok, LLaMA, Qwen, DeepSeek...) — and the best model, Grok-4, reached only 28% accuracy!

🔥 Takeaway:
LLMs are great at math, but when it comes to hardcore scientific reasoning in condensed matter physics, there’s still a huge gap.
That’s why we built CMPhysBench — to push AI forward in real scientific domains.

📂 Dataset & code are open-source 👉 [ https://github.com/CMPhysBench/CMPhysBench ]
Join us in exploring the next frontier of AI for Condensed Matter Physics!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.18124 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.18124 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.