We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers—including Goedel, STP, and Kimina-7B—struggle significantly. DeepSeek-Prover-V2 shows relative robustness—possibly because it is trained to decompose the problems into sub-problems—but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.
The Lean 4 environment and the corresponding Mathlib version used in this project follow from DeepSeek-Prover-V1.5. Please first install the correct Lean 4 and Mathlib version following the environment setup guide.
After installing the corresponding Lean 4 environment, please copy the benchmark and scripts_eval folder to the parent folder where you build your Mathlib. You should get the following file strueture (only show the important folders).
parent_folder/
├── benchmark/
├── configs/
├── mathlib4/
├── prover/
└── scripts_eval/
| Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-32B (w/o thinking) | 32 | 48.21.9 | 3.53.3 | 16.23.0 | 28.03.3 | 17.03.2 | 15.03.0 | 41.43.7 | 13.44.5 | 15.44.4 |
| 64 | 49.01.7 | 6.54.1 | 18.42.4 | 30.63.2 | 19.52.8 | 16.82.7 | 44.53.2 | 17.74.0 | 20.24.8 | |
| 128 | 49.92.0 | 10.64.2 | 20.02.5 | 32.62.9 | 21.83.2 | 19.02.6 | 47.43.1 | 21.13.7 | 25.44.2 | |
| 3200 | 52.0 | 40.0 | 36.0 | 44.0 | 32.0 | 28.0 | 52.0 | 36.0 | 36.0 | |
| DeepSeek-R1-Distill-Qwen-32B (w thinking) | 32 | 48.81.6 | 10.93.8 | 21.13.1 | 42.92.5 | 27.03.4 | 18.42.4 | 50.52.3 | 18.94.6 | 22.04.0 |
| 64 | 49.51.9 | 14.54.4 | 23.03.4 | 44.52.4 | 30.32.9 | 20.62.3 | 51.90.6 | 23.74.9 | 26.23.1 | |
| 128 | 50.92.1 | 19.24.1 | 26.14.3 | 46.22.3 | 32.62.7 | 22.12.0 | 52.00.0 | 28.03.9 | 29.42.7 | |
| 3200 | 60.0 | 44.0 | 44.0 | 56.0 | 40.0 | 24.0 | 52.0 | 36.0 | 40.0 |
| Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-Prover-V1.5-RL | 32 | 48.13.0 | 0.00.4 | 8.21.5 | 14.93.2 | 2.91.8 | 4.41.4 | 40.22.8 | 12.41.1 | 12.22.5 |
| 64 | 50.62.9 | 0.10.6 | 9.01.7 | 17.02.7 | 3.71.1 | 5.01.9 | 42.12.3 | 12.71.7 | 13.82.9 | |
| 128 | 52.22.1 | 0.20.8 | 9.82.0 | 18.72.7 | 4.00.0 | 6.12.3 | 43.21.6 | 13.32.2 | 16.22.9 | |
| 3200 | 60.0 | 4.0 | 24.0 | 24.0 | 4.0 | 12.0 | 44.0 | 20.0 | 28.0 | |
| Goedel-Prover-SFT | 32 | 48.62.9 | 0.41.2 | 14.03.2 | 34.82.5 | 12.43.5 | 21.53.4 | 47.01.7 | 14.43.1 | 24.61.9 |
| 64 | 50.62.6 | 0.81.6 | 16.62.8 | 36.21.9 | 15.83.4 | 24.62.9 | 47.80.9 | 16.62.5 | 25.51.9 | |
| 128 | 52.21.4 | 1.31.9 | 18.62.2 | 37.11.8 | 19.42.9 | 26.91.8 | 48.00.0 | 17.92.6 | 26.42.5 | |
| 3200 | 60.0 | 4.0 | 24.0 | 40.0 | 32.0 | 28.0 | 48.0 | 24.0 | 36.0 | |
| STP (w/o miniF2F valid) | 32 | 59.11.9 | 14.34.4 | 23.24.5 | 35.22.4 | 14.62.7 | 16.02.6 | 55.61.3 | 12.65.0 | 27.63.6 |
| 64 | 60.10.6 | 18.54.1 | 28.24.6 | 36.82.4 | 16.72.8 | 17.32.7 | 56.00.0 | 17.84.9 | 31.04.1 | |
| 128 | 60.31.1 | 24.34.1 | 33.03.6 | 37.92.6 | 18.43.0 | 18.93.3 | 56.00.0 | 24.04.4 | 33.94.1 | |
| 3200 | 64.0 | 44.0 | 40.0 | 44.0 | 24.0 | 28.0 | 56.0 | 36.0 | 40.0 | |
| Kimina-Prover-Preview-Distill-7B | 32 | 59.44.1 | 11.75.4 | 45.23.7 | 46.94.5 | 27.02.6 | 27.73.3 | 44.21.3 | 18.13.9 | 35.82.0 |
| 64 | 64.14.6 | 19.45.9 | 48.62.4 | 52.74.3 | 28.82.5 | 30.22.8 | 44.61.4 | 22.32.9 | 36.82.0 | |
| 128 | 69.44.2 | 28.25.4 | 50.62.2 | 57.63.6 | 30.43.0 | 32.01.6 | 45.11.8 | 25.62.5 | 37.62.5 | |
| 3200 | 80.0 | 44.0 | 64.0 | 68.0 | 52.0 | 36.0 | 52.0 | 32.0 | 44.0 | |
| DeepSeek-Prover-V2-7B | 32 | 75.04.4 | 58.64.0 | 52.54.5 | 64.64.1 | 33.02.3 | 35.02.3 | 59.12.9 | 49.33.4 | 38.84.4 |
| 64 | 80.75.3 | 62.14.5 | 57.44.0 | 68.33.1 | 34.72.7 | 36.62.3 | 61.72.5 | 51.62.9 | 43.74.2 | |
| 128 | 85.85.4 | 65.95.3 | 61.43.7 | 71.02.0 | 36.33.6 | 37.92.6 | 64.01.6 | 53.33.1 | 49.94.3 | |
| 3200 | 96.0 | 84.0 | 76.0 | 76.0 | 52.0 | 48.0 | 68.0 | 64.0 | 64.0 |
| Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-Prover-V1.5-RL + RMaxTS | 1×3200 | 60.00.0 | 3.01.7 | 22.02.0 | 24.00.0 | 8.02.8 | 13.03.3 | 44.00.0 | 14.03.5 | 29.01.7 |
| 2×3200 | 60.00.0 | 6.02.0 | 26.02.0 | 24.00.0 | 10.02.0 | 16.00.0 | 44.00.0 | 16.04.0 | 32.00.0 | |
| 4×3200 | 60.0 | 8.0 | 28.0 | 24.0 | 12.0 | 20.0 | 44.0 | 20.0 | 36.0 | |
| InternLM2.5-StepProver + BF | 1×32×600 | 30.83.1 | 0.00.0 | 2.53.1 | 12.00.0 | 0.00.0 | 1.21.9 | 34.02.0 | 2.22.0 | 17.03.9 |
| 4×32×600 | 38.04.5 | 0.00.0 | 9.03.3 | 12.00.0 | 0.00.0 | 3.01.7 | 36.00.0 | 5.01.7 | 21.01.7 | |
| 16×32×600 | 44.0 | 0.0 | 24.0 | 12.0 | 0.0 | 4.0 | 36.0 | 8.0 | 24.0 |
If you find our work helps, please consider starring ⭐ us and citing:


