DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning π
Welcome to the GitHub repository for DeepTheorem π, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems π.
Overview π
Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs π§ . However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing:
A Large-Scale Theorem Proving Dataset π:
- Contains high-quality, IMO-level informal theorems and proofs across diverse mathematical domains π.
- Rigorously annotated for correctness, difficulty, and topic categories β .
- Includes systematically constructed verifiable theorem variants for robust evaluation π.
RL-Zero π€:
- A novel reinforcement learning strategy tailored for informal theorem proving βοΈ.
- Utilizes verified theorem variants to incentivize robust mathematical inference π‘.
Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1)
Performance π
Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models.
Model | FIMO | HMMT | Putnam | Avg.(#Rank) | ||||
---|---|---|---|---|---|---|---|---|
out. | proc. | out. | proc. | out. | proc. | out. | proc. | |
Gemini2.5-Pro | 57.14 | 54.06 | 57.63 | 49.82 | 64.58 | 58.75 | 59.78(#2) | 54.21(#3) |
o1-mini | 60.32 | 55.23 | 35.59 | 30.90 | 61.46 | 52.88 | 52.46(#4) | 46.34(#4) |
o1 | 66.67 | 61.00 | 47.46 | 47.30 | 62.50 | 57.55 | 58.88(#3) | 55.28(#2) |
o3-mini | 80.95 | 77.61 | 45.76 | 43.47 | 78.12 | 75.12 | 68.28(#1) | 65.40(#1) |
*DeepTheorem-RL-7B | 55.56 | 39.07 | 28.81 | 20.85 | 57.29 | 42.20 | 47.22(#5) | 34.04(#5) |
*DeepTheorem-RL-3B | 38.10 | 23.39 | 25.42 | 13.56 | 52.08 | 33.84 | 38.53 | 23.60 |
*DeepTheorem-RL-1.5B | 31.75 | 15.23 | 23.73 | 10.15 | 52.08 | 22.79 | 35.85 | 16.06 |
DeepTheorem Dataset π
The DeepTheorem dataset comprises 121K IMO-level informal theorems and proofs spanning diverse mathematical domains π. Each theorem-proof pair is rigorously annotated for:
- o3-mini Proofs ποΈ: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model β .
- Truth Value π: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false βοΈ.
- Difficulty ποΈ: Categorized by complexity to suit various LLM capabilities π§©.
- Topic Categories ποΈ: Covering algebra, geometry, number theory, and more π.
- Variants π: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem π.
- Downloads last month
- 12