We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers—including Goedel, STP, and Kimina-7B—struggle significantly. DeepSeek-Prover-V2 shows relative robustness—possibly because it is trained to decompose the problems into sub-problems—but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.
The Lean 4 environment and the corresponding Mathlib version used in this project follow from DeepSeek-Prover-V1.5. Please first install the correct Lean 4 and Mathlib version following the environment setup guide.
After installing the corresponding Lean 4 environment, please copy the benchmark/
and scripts_eval/
folder to the parent folder where you build your Mathlib. You should get the following file strueture (only show the important folders).
parent_folder/
├── benchmark/
├── configs/
├── mathlib4/
├── prover/
└── scripts_eval/
Please run the following command to test DeepSeek-R1-Distill-Qwen-32B without the thinking block (chat template) under pass@32. This will test the model on the 25 seed problems from Ineq-AMGM.
bash scripts_eval/inference_2gpu.sh -i benchmark/amgm_seed.jsonl -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B -o results/amgm_seed_r1-distill-qwen-32b_nothink -n 32
The script will: (1) inference using VLLM and extract the Lean code (need 2 gpus); (2) submit the code to REPL and verified by Lean 4 compiler. (no gpu needed)
For DeepSeek-R1-Distill-Qwen-32B with the thinking block, please run
bash scripts_eval/inference_think_2gpu.sh -i benchmark/amgm_seed.jsonl -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B -o results/amgm_seed_r1-distill-qwen-32b -n 32
To test DeepSeek-Prover-V1.5-RL, Goedel-Prover-SFT, STP, please using the script that test DeepSeek-R1-Distill-Qwen-32B without the thinking block while changing the model to your target model (taking STP as an example).
bash scripts_eval/inference_2gpu.sh -i benchmark/amgm_seed.jsonl -m kfdong/STP_model_Lean -o results/amgm_seed_stp -n 32
For Kimina-Prover-Preview-Distill-7B, please run the following script
bash scripts_eval/inference_kimina_2gpu.sh -i benchmark/amgm_seed.jsonl -m AI-MO/Kimina-Prover-Preview-Distill-7B -o results/amgm_seed_kimina-7b -n 32
For DeepSeek-Prover-V2-7B, please run the following script
bash scripts_eval/inference_dsprover2_2gpu.sh -i benchmark/amgm_seed.jsonl -m deepseek-ai/DeepSeek-Prover-V2-7B -o results/amgm_seed_dsprover2-7b -n 32
The main logic under all these scripts is very similar, and the difference between different scirpts (for DeepSeek-R1-Distill-Qwen-32B, Kimina-Prover-Preview-Distill-7B, and DeepSeek-Prover-V2-7B) is the prompt and the response template.
Note 1: all the scripts can be technically run with 2 H100 80GB GPUs. However, we recommend to use 4 H100 80GB GPUs when testing DeepSeek-R1-Distill-Qwen-32B, Kimina-Prover-Preview-Distill-7B, and DeepSeek-Prover-V2-7B, especially if you want to test with more than 16K generation length, since for some VLLM versions it might cause GPU OOM with only 2 gpus.
Note 2: we highly recommend splitting the job into smaller ones, especially when testing DeepSeek-R1-Distill-Qwen-32B, Kimina-Prover-Preview-Distill-7B, and DeepSeek-Prover-V2-7B or testing models under high budget (pass@3200). We include the SLURM head in each scripts for better parallelization with more GPU resources, please refer to the scripts for more details.
The experiments for DeepSeek-Prover-V1.5-RL+RMaxTX can be reproduced using exactly the same command in the original DeepSeek-Prover-V1.5 repo, by changing the dataset to the benchmark dataset (benchmark/amgm_seed.jsonl
) in configs/RMaxTS.py file:
python -m prover.launch --config=configs/RMaxTS.py --log_dir=logs/RMaxTS_results
Please refer to DeepSeek-Prover-V1.5 codebase for more details.
The evaluation code is based on the InternLM Github Repo, and particularly, the evaluation code for MiniF2F, with only minimal modifications. Please follow the repo to install the correct Lean 4 and other corresponding package version, especially LeanDojo.
After installing the corresponding environment, please substitute the MiniF2F by the
LeanIneqComp-Dojo
, which is the Github Repo of Ineq-Comp tracable by Leandojo and adapted to the Lean 4 version used by InternLM2.5-StepProver.
All the data of our Ineq-Comp benchmark, including the proofs of the seed problems, can be found in the benchmark
folder, and is also available on Huggingface.
For Ineq-Simp, which contains Ineq-AMGM, Ineq-Cauchy, and Ineq-MISC, the questions can be found at benchmark/amgm_*.jsonl
, benchmark/cauchy_*.jsonl
, and benchmark/misc_*.jsonl
.
benchmark/comp2_100.jsonl
contains 100 inequality problems generated using Ineq-Mix by randomly composing two seed problems from Ineq-Simp.
benchmark/real.jsonl
contains 50 real-world inequality problems.
The proof for the 75 seed problems can be found in benchmark/full_answer/
folder. We include the proof for two Lean 4 versions: 4.9, where our main experiments are based on (which is also the version used by DeepSeek-Prover-V1.5), and 4.18, which is the Lean 4 version (stable) used by the interactive Lean 4 web at the time this benchmark is curated.
Our Ineq-Mix framework that can generates more problems (given a pool of base problem) can be found in the composition/
folder.
original_problems.jsonl
includes 65 seed problems that can be possibly composed or transformed. The composition rules, the variable-level algebraic transformation rules, and the problem-level algebraic transformation rules are defined in composition/comp_op.py
, composition/algebraic_op.py
, and composition/algebraic_whole_op.py
, respectively. Please refer to composition/mix.py
for more details.
Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
---|---|---|---|---|---|---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-32B (w/o thinking) | 32 | 48.21.9 | 3.53.3 | 16.23.0 | 28.03.3 | 17.03.2 | 15.03.0 | 41.43.7 | 13.44.5 | 15.44.4 |
64 | 49.01.7 | 6.54.1 | 18.42.4 | 30.63.2 | 19.52.8 | 16.82.7 | 44.53.2 | 17.74.0 | 20.24.8 | |
128 | 49.92.0 | 10.64.2 | 20.02.5 | 32.62.9 | 21.83.2 | 19.02.6 | 47.43.1 | 21.13.7 | 25.44.2 | |
3200 | 52.0 | 40.0 | 36.0 | 44.0 | 32.0 | 28.0 | 52.0 | 36.0 | 36.0 | |
DeepSeek-R1-Distill-Qwen-32B (w thinking) | 32 | 48.81.6 | 10.93.8 | 21.13.1 | 42.92.5 | 27.03.4 | 18.42.4 | 50.52.3 | 18.94.6 | 22.04.0 |
64 | 49.51.9 | 14.54.4 | 23.03.4 | 44.52.4 | 30.32.9 | 20.62.3 | 51.90.6 | 23.74.9 | 26.23.1 | |
128 | 50.92.1 | 19.24.1 | 26.14.3 | 46.22.3 | 32.62.7 | 22.12.0 | 52.00.0 | 28.03.9 | 29.42.7 | |
3200 | 60.0 | 44.0 | 44.0 | 56.0 | 40.0 | 24.0 | 52.0 | 36.0 | 40.0 |
Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
---|---|---|---|---|---|---|---|---|---|---|
DeepSeek-Prover-V1.5-RL | 32 | 48.13.0 | 0.00.4 | 8.21.5 | 14.93.2 | 2.91.8 | 4.41.4 | 40.22.8 | 12.41.1 | 12.22.5 |
64 | 50.62.9 | 0.10.6 | 9.01.7 | 17.02.7 | 3.71.1 | 5.01.9 | 42.12.3 | 12.71.7 | 13.82.9 | |
128 | 52.22.1 | 0.20.8 | 9.82.0 | 18.72.7 | 4.00.0 | 6.12.3 | 43.21.6 | 13.32.2 | 16.22.9 | |
3200 | 60.0 | 4.0 | 24.0 | 24.0 | 4.0 | 12.0 | 44.0 | 20.0 | 28.0 | |
Goedel-Prover-SFT | 32 | 48.62.9 | 0.41.2 | 14.03.2 | 34.82.5 | 12.43.5 | 21.53.4 | 47.01.7 | 14.43.1 | 24.61.9 |
64 | 50.62.6 | 0.81.6 | 16.62.8 | 36.21.9 | 15.83.4 | 24.62.9 | 47.80.9 | 16.62.5 | 25.51.9 | |
128 | 52.21.4 | 1.31.9 | 18.62.2 | 37.11.8 | 19.42.9 | 26.91.8 | 48.00.0 | 17.92.6 | 26.42.5 | |
3200 | 60.0 | 4.0 | 24.0 | 40.0 | 32.0 | 28.0 | 48.0 | 24.0 | 36.0 | |
STP (w/o miniF2F valid) | 32 | 59.11.9 | 14.34.4 | 23.24.5 | 35.22.4 | 14.62.7 | 16.02.6 | 55.61.3 | 12.65.0 | 27.63.6 |
64 | 60.10.6 | 18.54.1 | 28.24.6 | 36.82.4 | 16.72.8 | 17.32.7 | 56.00.0 | 17.84.9 | 31.04.1 | |
128 | 60.31.1 | 24.34.1 | 33.03.6 | 37.92.6 | 18.43.0 | 18.93.3 | 56.00.0 | 24.04.4 | 33.94.1 | |
3200 | 64.0 | 44.0 | 40.0 | 44.0 | 24.0 | 28.0 | 56.0 | 36.0 | 40.0 | |
Kimina-Prover-Preview-Distill-7B | 32 | 59.44.1 | 11.75.4 | 45.23.7 | 46.94.5 | 27.02.6 | 27.73.3 | 44.21.3 | 18.13.9 | 35.82.0 |
64 | 64.14.6 | 19.45.9 | 48.62.4 | 52.74.3 | 28.82.5 | 30.22.8 | 44.61.4 | 22.32.9 | 36.82.0 | |
128 | 69.44.2 | 28.25.4 | 50.62.2 | 57.63.6 | 30.43.0 | 32.01.6 | 45.11.8 | 25.62.5 | 37.62.5 | |
3200 | 80.0 | 44.0 | 64.0 | 68.0 | 52.0 | 36.0 | 52.0 | 32.0 | 44.0 | |
DeepSeek-Prover-V2-7B | 32 | 75.04.4 | 58.64.0 | 52.54.5 | 64.64.1 | 33.02.3 | 35.02.3 | 59.12.9 | 49.33.4 | 38.84.4 |
64 | 80.75.3 | 62.14.5 | 57.44.0 | 68.33.1 | 34.72.7 | 36.62.3 | 61.72.5 | 51.62.9 | 43.74.2 | |
128 | 85.85.4 | 65.95.3 | 61.43.7 | 71.02.0 | 36.33.6 | 37.92.6 | 64.01.6 | 53.33.1 | 49.94.3 | |
3200 | 96.0 | 84.0 | 76.0 | 76.0 | 52.0 | 48.0 | 68.0 | 64.0 | 64.0 |
Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
---|---|---|---|---|---|---|---|---|---|---|
DeepSeek-Prover-V1.5-RL + RMaxTS | 1×3200 | 60.00.0 | 3.01.7 | 22.02.0 | 24.00.0 | 8.02.8 | 13.03.3 | 44.00.0 | 14.03.5 | 29.01.7 |
2×3200 | 60.00.0 | 6.02.0 | 26.02.0 | 24.00.0 | 10.02.0 | 16.00.0 | 44.00.0 | 16.04.0 | 32.00.0 | |
4×3200 | 60.0 | 8.0 | 28.0 | 24.0 | 12.0 | 20.0 | 44.0 | 20.0 | 36.0 | |
InternLM2.5-StepProver + BF | 1×32×600 | 30.83.1 | 0.00.0 | 2.53.1 | 12.00.0 | 0.00.0 | 1.21.9 | 34.02.0 | 2.22.0 | 17.03.9 |
4×32×600 | 38.04.5 | 0.00.0 | 9.03.3 | 12.00.0 | 0.00.0 | 3.01.7 | 36.00.0 | 5.01.7 | 21.01.7 | |
16×32×600 | 44.0 | 0.0 | 24.0 | 12.0 | 0.0 | 4.0 | 36.0 | 8.0 | 24.0 |
If you find our work helps, please consider starring ⭐ us and citing:
@article{zhao2025ineq,
title={Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities},
author={Zhao, Haoyu and Geng, Yihan and Tang, Shange and Lin, Yong and Lyu, Bohan and Lin, Hongzhou and Jin, Chi and Arora, Sanjeev},
journal={arXiv preprint arXiv:2505.12680},
year={2025}
}