Papers
arxiv:2506.07047

Mathesis: Towards Formal Theorem Proving from Natural Languages

Published on Jun 8
· Submitted by Jianyuan1 on Jun 11
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.

Community

Paper author Paper submitter

This paper introduces Mathesis, the first end-to-end theorem-proving pipeline designed to solve complex mathematical problems directly from informal, natural language statements. The system features the Mathesis-Autoformalizer, which uses reinforcement learning to translate natural language problems into the Lean 4 formal language. The pipeline's performance is tested on the newly introduced Gaokao-Formal benchmark, a challenging set of 488 problems from China's national college entrance exam. The paper also presents LeanScorer, a novel framework for evaluating the semantic accuracy of these formalizations. When combined, the Mathesis components achieve state-of-the-art results, demonstrating a significant advancement in automated formal reasoning.

Here are the key highlights:

  • Pioneering End-to-End System: Mathesis is the first theorem-proving pipeline that processes informal problem statements from start to finish.
  • Advanced Autoformalizer: The Mathesis-Autoformalizer is the first of its kind to use reinforcement learning and Hierarchical Preference Optimization (HPO) to improve its ability to formalize natural language problems. It outperforms the best baseline by 22% in pass-rate on the Gaokao-Formal benchmark.
  • New Challenging Benchmark: The paper introduces Gaokao-Formal, a benchmark with 488 complex problems from China's Gaokao exam, aimed at advancing research on formalizing diverse and intricate mathematical problems.
  • Nuanced Evaluation Framework: The authors developed LeanScorer, a novel evaluation method that combines LLM-based analysis with a Sugeno Fuzzy Integral to assess formalization quality. It achieves a 0.92 F1 score, outperforming previous methods.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.07047 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.07047 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.07047 in a Space README.md to link it from this page.

Collections including this paper 1