arxiv:2508.18124

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Published on Aug 25

· Submitted by

weidawang on Aug 27

#3 Paper of the day

Upvote

Authors:

Weida Wang ,

Di Zhang ,

Wei Ma ,

Shuchen Pu ,

Abstract

CMPhysBench evaluates LLMs in condensed matter physics using calculation problems and a new SEED score for partial credit assessment, revealing significant capability gaps.

AI-generated summary

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

View arXiv page View PDF GitHub 16 Add to collection

Community

weidawang

Paper author Paper submitter 1 day ago

•

edited 1 day ago

🚀 Can Large Language Models Pass Grad-Level Condensed Matter Physics?

We just released CMPhysBench, a brand-new open-source benchmark!

✨ Highlights:

🔬 520 graduate-level problems in condensed matter physics — spanning magnetism, superconductivity, semiconductors, and strongly correlated systems
📖 Curated from 17 authoritative textbooks, written & reviewed by PhD students and postdocs
🧮 Introducing SEED (Scalable Expression Edit Distance) — a smarter metric that gives partial credit for “almost correct” answers, across all these answer types, instead of simple right-or-wrong grading
🤖 Tested on 18 major LLMs (GPT-4o, Claude 3.7, Gemini, Grok, LLaMA, Qwen, DeepSeek...) — and the best model, Grok-4, reached only 28% accuracy!

🔥 Takeaway:
LLMs are great at math, but when it comes to hardcore scientific reasoning in condensed matter physics, there’s still a huge gap.
That’s why we built CMPhysBench — to push AI forward in real scientific domains.

📂 Dataset & code are open-source 👉 [ https://github.com/CMPhysBench/CMPhysBench ]
Join us in exploring the next frontier of AI for Condensed Matter Physics!