Skip to content

haoyuzhao123/LeanIneqComp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hugging Face arXiv License

1. Introduction

We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers—including Goedel, STP, and Kimina-7B—struggle significantly. DeepSeek-Prover-V2 shows relative robustness—possibly because it is trained to decompose the problems into sub-problems—but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.

2. Environment Setup

Lean 4 Environment

The Lean 4 environment and the corresponding Mathlib version used in this project follow from DeepSeek-Prover-V1.5. Please first install the correct Lean 4 and Mathlib version following the environment setup guide.

Copy Data and Testing Scripts

After installing the corresponding Lean 4 environment, please copy the benchmark and scripts_eval folder to the parent folder where you build your Mathlib. You should get the following file strueture (only show the important folders).

parent_folder/
├── benchmark/
├── configs/
├── mathlib4/
├── prover/
└── scripts_eval/

3. Quick Start

General-Purpose Models

Whole-Proof Generation Methods

In-Context Learning Experiments

DeepSeek-Prover-V1.5-RL+RMaxTX

InternLM2.5-StepProver+BF

4. Evaluation Results

General-Purpose Models

Method Budget AM-GM Seed AM-GM I AM-GM II Cauchy Seed Cauchy I Cauchy II Misc Seed Misc I Misc II
DeepSeek-R1-Distill-Qwen-32B (w/o thinking) 32 48.21.9 3.53.3 16.23.0 28.03.3 17.03.2 15.03.0 41.43.7 13.44.5 15.44.4
64 49.01.7 6.54.1 18.42.4 30.63.2 19.52.8 16.82.7 44.53.2 17.74.0 20.24.8
128 49.92.0 10.64.2 20.02.5 32.62.9 21.83.2 19.02.6 47.43.1 21.13.7 25.44.2
3200 52.0 40.0 36.0 44.0 32.0 28.0 52.0 36.0 36.0
DeepSeek-R1-Distill-Qwen-32B (w thinking) 32 48.81.6 10.93.8 21.13.1 42.92.5 27.03.4 18.42.4 50.52.3 18.94.6 22.04.0
64 49.51.9 14.54.4 23.03.4 44.52.4 30.32.9 20.62.3 51.90.6 23.74.9 26.23.1
128 50.92.1 19.24.1 26.14.3 46.22.3 32.62.7 22.12.0 52.00.0 28.03.9 29.42.7
3200 60.0 44.0 44.0 56.0 40.0 24.0 52.0 36.0 40.0

Whole-Proof Generation Methods

Method Budget AM-GM Seed AM-GM I AM-GM II Cauchy Seed Cauchy I Cauchy II Misc Seed Misc I Misc II
DeepSeek-Prover-V1.5-RL 32 48.13.0 0.00.4 8.21.5 14.93.2 2.91.8 4.41.4 40.22.8 12.41.1 12.22.5
64 50.62.9 0.10.6 9.01.7 17.02.7 3.71.1 5.01.9 42.12.3 12.71.7 13.82.9
128 52.22.1 0.20.8 9.82.0 18.72.7 4.00.0 6.12.3 43.21.6 13.32.2 16.22.9
3200 60.0 4.0 24.0 24.0 4.0 12.0 44.0 20.0 28.0
Goedel-Prover-SFT 32 48.62.9 0.41.2 14.03.2 34.82.5 12.43.5 21.53.4 47.01.7 14.43.1 24.61.9
64 50.62.6 0.81.6 16.62.8 36.21.9 15.83.4 24.62.9 47.80.9 16.62.5 25.51.9
128 52.21.4 1.31.9 18.62.2 37.11.8 19.42.9 26.91.8 48.00.0 17.92.6 26.42.5
3200 60.0 4.0 24.0 40.0 32.0 28.0 48.0 24.0 36.0
STP (w/o miniF2F valid) 32 59.11.9 14.34.4 23.24.5 35.22.4 14.62.7 16.02.6 55.61.3 12.65.0 27.63.6
64 60.10.6 18.54.1 28.24.6 36.82.4 16.72.8 17.32.7 56.00.0 17.84.9 31.04.1
128 60.31.1 24.34.1 33.03.6 37.92.6 18.43.0 18.93.3 56.00.0 24.04.4 33.94.1
3200 64.0 44.0 40.0 44.0 24.0 28.0 56.0 36.0 40.0
Kimina-Prover-Preview-Distill-7B 32 59.44.1 11.75.4 45.23.7 46.94.5 27.02.6 27.73.3 44.21.3 18.13.9 35.82.0
64 64.14.6 19.45.9 48.62.4 52.74.3 28.82.5 30.22.8 44.61.4 22.32.9 36.82.0
128 69.44.2 28.25.4 50.62.2 57.63.6 30.43.0 32.01.6 45.11.8 25.62.5 37.62.5
3200 80.0 44.0 64.0 68.0 52.0 36.0 52.0 32.0 44.0
DeepSeek-Prover-V2-7B 32 75.04.4 58.64.0 52.54.5 64.64.1 33.02.3 35.02.3 59.12.9 49.33.4 38.84.4
64 80.75.3 62.14.5 57.44.0 68.33.1 34.72.7 36.62.3 61.72.5 51.62.9 43.74.2
128 85.85.4 65.95.3 61.43.7 71.02.0 36.33.6 37.92.6 64.01.6 53.33.1 49.94.3
3200 96.0 84.0 76.0 76.0 52.0 48.0 68.0 64.0 64.0

Tree-Search Methods

Method Budget AM-GM Seed AM-GM I AM-GM II Cauchy Seed Cauchy I Cauchy II Misc Seed Misc I Misc II
DeepSeek-Prover-V1.5-RL + RMaxTS 1×3200 60.00.0 3.01.7 22.02.0 24.00.0 8.02.8 13.03.3 44.00.0 14.03.5 29.01.7
2×3200 60.00.0 6.02.0 26.02.0 24.00.0 10.02.0 16.00.0 44.00.0 16.04.0 32.00.0
4×3200 60.0 8.0 28.0 24.0 12.0 20.0 44.0 20.0 36.0
InternLM2.5-StepProver + BF 1×32×600 30.83.1 0.00.0 2.53.1 12.00.0 0.00.0 1.21.9 34.02.0 2.22.0 17.03.9
4×32×600 38.04.5 0.00.0 9.03.3 12.00.0 0.00.0 3.01.7 36.00.0 5.01.7 21.01.7
16×32×600 44.0 0.0 24.0 12.0 0.0 4.0 36.0 8.0 24.0

5. Citation

If you find our work helps, please consider starring ⭐ us and citing:


About

An inequality benchmark for theorem proving

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •