DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning πŸš€

Welcome to the GitHub repository for DeepTheorem πŸŽ‰, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems 🌟.

Overview πŸ“–

Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs 🧠. However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing:

  1. A Large-Scale Theorem Proving Dataset πŸ“Š:

    • Contains high-quality, IMO-level informal theorems and proofs across diverse mathematical domains πŸ“š.
    • Rigorously annotated for correctness, difficulty, and topic categories βœ….
    • Includes systematically constructed verifiable theorem variants for robust evaluation πŸ”.
  2. RL-Zero πŸ€–:

    • A novel reinforcement learning strategy tailored for informal theorem proving βš™οΈ.
    • Utilizes verified theorem variants to incentivize robust mathematical inference πŸ’‘.

Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1)

Performance πŸš€

Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models.

Model FIMO HMMT Putnam Avg.(#Rank)
out. proc. out. proc. out. proc. out. proc.
Gemini2.5-Pro 57.14 54.06 57.63 49.82 64.58 58.75 59.78(#2) 54.21(#3)
o1-mini 60.32 55.23 35.59 30.90 61.46 52.88 52.46(#4) 46.34(#4)
o1 66.67 61.00 47.46 47.30 62.50 57.55 58.88(#3) 55.28(#2)
o3-mini 80.95 77.61 45.76 43.47 78.12 75.12 68.28(#1) 65.40(#1)
*DeepTheorem-RL-7B 55.56 39.07 28.81 20.85 57.29 42.20 47.22(#5) 34.04(#5)
*DeepTheorem-RL-3B 38.10 23.39 25.42 13.56 52.08 33.84 38.53 23.60
*DeepTheorem-RL-1.5B 31.75 15.23 23.73 10.15 52.08 22.79 35.85 16.06

DeepTheorem Dataset πŸ“Š

The DeepTheorem dataset comprises 121K IMO-level informal theorems and proofs spanning diverse mathematical domains 🌐. Each theorem-proof pair is rigorously annotated for:

  • o3-mini Proofs πŸ–‹οΈ: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model βœ….
  • Truth Value πŸ”: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false βœ”οΈ.
  • Difficulty 🎚️: Categorized by complexity to suit various LLM capabilities 🧩.
  • Topic Categories πŸ—‚οΈ: Covering algebra, geometry, number theory, and more πŸ“˜.
  • Variants πŸ”„: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem πŸ”€.
Downloads last month
24
Safetensors
Model size
1.78B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jiahao004/DeepTheorem-qwen-1.5b-rl

Quantizations
2 models

Collection including Jiahao004/DeepTheorem-qwen-1.5b-rl