Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

1. Introduction

We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers—including Goedel, STP, and Kimina-7B—struggle significantly. DeepSeek-Prover-V2 shows relative robustness—possibly because it is trained to decompose the problems into sub-problems—but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.

2. Environment Setup

Lean 4 Environment

The Lean 4 environment and the corresponding Mathlib version used in this project follow from DeepSeek-Prover-V1.5. Please first install the correct Lean 4 and Mathlib version following the environment setup guide.

Copy Data and Testing Scripts

After installing the corresponding Lean 4 environment, please copy the benchmark and scripts_eval folder to the parent folder where you build your Mathlib. You should get the following file strueture (only show the important folders).

parent_folder/
├── benchmark/
├── configs/
├── mathlib4/
├── prover/
└── scripts_eval/

3. Quick Start

General-Purpose Models

Whole-Proof Generation Methods

In-Context Learning Experiments

DeepSeek-Prover-V1.5-RL+RMaxTX

InternLM2.5-StepProver+BF

4. Evaluation Results

General-Purpose Models

Method	Budget	AM-GM Seed	AM-GM I	AM-GM II	Cauchy Seed	Cauchy I	Cauchy II	Misc Seed	Misc I	Misc II
DeepSeek-R1-Distill-Qwen-32B (w/o thinking)	32	48.2_1.9	3.5_3.3	16.2_3.0	28.0_3.3	17.0_3.2	15.0_3.0	41.4_3.7	13.4_4.5	15.4_4.4
	64	49.0_1.7	6.5_4.1	18.4_2.4	30.6_3.2	19.5_2.8	16.8_2.7	44.5_3.2	17.7_4.0	20.2_4.8
	128	49.9_2.0	10.6_4.2	20.0_2.5	32.6_2.9	21.8_3.2	19.0_2.6	47.4_3.1	21.1_3.7	25.4_4.2
	3200	52.0	40.0	36.0	44.0	32.0	28.0	52.0	36.0	36.0
DeepSeek-R1-Distill-Qwen-32B (w thinking)	32	48.8_1.6	10.9_3.8	21.1_3.1	42.9_2.5	27.0_3.4	18.4_2.4	50.5_2.3	18.9_4.6	22.0_4.0
	64	49.5_1.9	14.5_4.4	23.0_3.4	44.5_2.4	30.3_2.9	20.6_2.3	51.9_0.6	23.7_4.9	26.2_3.1
	128	50.9_2.1	19.2_4.1	26.1_4.3	46.2_2.3	32.6_2.7	22.1_2.0	52.0_0.0	28.0_3.9	29.4_2.7
	3200	60.0	44.0	44.0	56.0	40.0	24.0	52.0	36.0	40.0

Whole-Proof Generation Methods

Method	Budget	AM-GM Seed	AM-GM I	AM-GM II	Cauchy Seed	Cauchy I	Cauchy II	Misc Seed	Misc I	Misc II
DeepSeek-Prover-V1.5-RL	32	48.1_3.0	0.0_0.4	8.2_1.5	14.9_3.2	2.9_1.8	4.4_1.4	40.2_2.8	12.4_1.1	12.2_2.5
	64	50.6_2.9	0.1_0.6	9.0_1.7	17.0_2.7	3.7_1.1	5.0_1.9	42.1_2.3	12.7_1.7	13.8_2.9
	128	52.2_2.1	0.2_0.8	9.8_2.0	18.7_2.7	4.0_0.0	6.1_2.3	43.2_1.6	13.3_2.2	16.2_2.9
	3200	60.0	4.0	24.0	24.0	4.0	12.0	44.0	20.0	28.0
Goedel-Prover-SFT	32	48.6_2.9	0.4_1.2	14.0_3.2	34.8_2.5	12.4_3.5	21.5_3.4	47.0_1.7	14.4_3.1	24.6_1.9
	64	50.6_2.6	0.8_1.6	16.6_2.8	36.2_1.9	15.8_3.4	24.6_2.9	47.8_0.9	16.6_2.5	25.5_1.9
	128	52.2_1.4	1.3_1.9	18.6_2.2	37.1_1.8	19.4_2.9	26.9_1.8	48.0_0.0	17.9_2.6	26.4_2.5
	3200	60.0	4.0	24.0	40.0	32.0	28.0	48.0	24.0	36.0
STP (w/o miniF2F valid)	32	59.1_1.9	14.3_4.4	23.2_4.5	35.2_2.4	14.6_2.7	16.0_2.6	55.6_1.3	12.6_5.0	27.6_3.6
	64	60.1_0.6	18.5_4.1	28.2_4.6	36.8_2.4	16.7_2.8	17.3_2.7	56.0_0.0	17.8_4.9	31.0_4.1
	128	60.3_1.1	24.3_4.1	33.0_3.6	37.9_2.6	18.4_3.0	18.9_3.3	56.0_0.0	24.0_4.4	33.9_4.1
	3200	64.0	44.0	40.0	44.0	24.0	28.0	56.0	36.0	40.0
Kimina-Prover-Preview-Distill-7B	32	59.4_4.1	11.7_5.4	45.2_3.7	46.9_4.5	27.0_2.6	27.7_3.3	44.2_1.3	18.1_3.9	35.8_2.0
	64	64.1_4.6	19.4_5.9	48.6_2.4	52.7_4.3	28.8_2.5	30.2_2.8	44.6_1.4	22.3_2.9	36.8_2.0
	128	69.4_4.2	28.2_5.4	50.6_2.2	57.6_3.6	30.4_3.0	32.0_1.6	45.1_1.8	25.6_2.5	37.6_2.5
	3200	80.0	44.0	64.0	68.0	52.0	36.0	52.0	32.0	44.0
DeepSeek-Prover-V2-7B	32	75.0_4.4	58.6_4.0	52.5_4.5	64.6_4.1	33.0_2.3	35.0_2.3	59.1_2.9	49.3_3.4	38.8_4.4
	64	80.7_5.3	62.1_4.5	57.4_4.0	68.3_3.1	34.7_2.7	36.6_2.3	61.7_2.5	51.6_2.9	43.7_4.2
	128	85.8_5.4	65.9_5.3	61.4_3.7	71.0_2.0	36.3_3.6	37.9_2.6	64.0_1.6	53.3_3.1	49.9_4.3
	3200	96.0	84.0	76.0	76.0	52.0	48.0	68.0	64.0	64.0

Tree-Search Methods

Method	Budget	AM-GM Seed	AM-GM I	AM-GM II	Cauchy Seed	Cauchy I	Cauchy II	Misc Seed	Misc I	Misc II
DeepSeek-Prover-V1.5-RL + RMaxTS	1×3200	60.0_0.0	3.0_1.7	22.0_2.0	24.0_0.0	8.0_2.8	13.0_3.3	44.0_0.0	14.0_3.5	29.0_1.7
	2×3200	60.0_0.0	6.0_2.0	26.0_2.0	24.0_0.0	10.0_2.0	16.0_0.0	44.0_0.0	16.0_4.0	32.0_0.0
	4×3200	60.0	8.0	28.0	24.0	12.0	20.0	44.0	20.0	36.0
InternLM2.5-StepProver + BF	1×32×600	30.8_3.1	0.0_0.0	2.5_3.1	12.0_0.0	0.0_0.0	1.2_1.9	34.0_2.0	2.2_2.0	17.0_3.9
	4×32×600	38.0_4.5	0.0_0.0	9.0_3.3	12.0_0.0	0.0_0.0	3.0_1.7	36.0_0.0	5.0_1.7	21.0_1.7
	16×32×600	44.0	0.0	24.0	12.0	0.0	4.0	36.0	8.0	24.0

5. Citation

If you find our work helps, please consider starring ⭐ us and citing:

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
assets		assets
benchmark		benchmark
composition		composition
scripts_eval		scripts_eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

1. Introduction

2. Environment Setup

Lean 4 Environment

Copy Data and Testing Scripts

3. Quick Start

General-Purpose Models

Whole-Proof Generation Methods

In-Context Learning Experiments

DeepSeek-Prover-V1.5-RL+RMaxTX

InternLM2.5-StepProver+BF

4. Evaluation Results

General-Purpose Models

Whole-Proof Generation Methods

Tree-Search Methods

5. Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

haoyuzhao123/LeanIneqComp

Folders and files

Latest commit

History

Repository files navigation

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

1. Introduction

2. Environment Setup

Lean 4 Environment

Copy Data and Testing Scripts

3. Quick Start

General-Purpose Models

Whole-Proof Generation Methods

In-Context Learning Experiments

DeepSeek-Prover-V1.5-RL+RMaxTX

InternLM2.5-StepProver+BF

4. Evaluation Results

General-Purpose Models

Whole-Proof Generation Methods

Tree-Search Methods

5. Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages