Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities
| Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-32B (w/o thinking) | 32 | 48.21.9 | 3.53.3 | 16.23.0 | 28.03.3 | 17.03.2 | 15.03.0 | 41.43.7 | 13.44.5 | 15.44.4 |
| 64 | 49.01.7 | 6.54.1 | 18.42.4 | 30.63.2 | 19.52.8 | 16.82.7 | 44.53.2 | 17.74.0 | 20.24.8 | |
| 128 | 49.92.0 | 10.64.2 | 20.02.5 | 32.62.9 | 21.83.2 | 19.02.6 | 47.43.1 | 21.13.7 | 25.44.2 | |
| 3200 | 52.0 | 40.0 | 36.0 | 44.0 | 32.0 | 28.0 | 52.0 | 36.0 | 36.0 | |
| DeepSeek-R1-Distill-Qwen-32B (w thinking) | 32 | 48.81.6 | 10.93.8 | 21.13.1 | 42.92.5 | 27.03.4 | 18.42.4 | 50.52.3 | 18.94.6 | 22.04.0 |
| 64 | 49.51.9 | 14.54.4 | 23.03.4 | 44.52.4 | 30.32.9 | 20.62.3 | 51.90.6 | 23.74.9 | 26.23.1 | |
| 128 | 50.92.1 | 19.24.1 | 26.14.3 | 46.22.3 | 32.62.7 | 22.12.0 | 52.00.0 | 28.03.9 | 29.42.7 | |
| 3200 | 60.0 | 44.0 | 44.0 | 56.0 | 40.0 | 24.0 | 52.0 | 36.0 | 40.0 |
| Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-Prover-V1.5-RL | 32 | 48.13.0 | 0.00.4 | 8.21.5 | 14.93.2 | 2.91.8 | 4.41.4 | 40.22.8 | 12.41.1 | 12.22.5 |
| 64 | 50.62.9 | 0.10.6 | 9.01.7 | 17.02.7 | 3.71.1 | 5.01.9 | 42.12.3 | 12.71.7 | 13.82.9 | |
| 128 | 52.22.1 | 0.20.8 | 9.82.0 | 18.72.7 | 4.00.0 | 6.12.3 | 43.21.6 | 13.32.2 | 16.22.9 | |
| 3200 | 60.0 | 4.0 | 24.0 | 24.0 | 4.0 | 12.0 | 44.0 | 20.0 | 28.0 | |
| Goedel-Prover-SFT | 32 | 48.62.9 | 0.41.2 | 14.03.2 | 34.82.5 | 12.43.5 | 21.53.4 | 47.01.7 | 14.43.1 | 24.61.9 |
| 64 | 50.62.6 | 0.81.6 | 16.62.8 | 36.21.9 | 15.83.4 | 24.62.9 | 47.80.9 | 16.62.5 | 25.51.9 | |
| 128 | 52.21.4 | 1.31.9 | 18.62.2 | 37.11.8 | 19.42.9 | 26.91.8 | 48.00.0 | 17.92.6 | 26.42.5 | |
| 3200 | 60.0 | 4.0 | 24.0 | 40.0 | 32.0 | 28.0 | 48.0 | 24.0 | 36.0 | |
| STP (w/o miniF2F valid) | 32 | 59.11.9 | 14.34.4 | 23.24.5 | 35.22.4 | 14.62.7 | 16.02.6 | 55.61.3 | 12.65.0 | 27.63.6 |
| 64 | 60.10.6 | 18.54.1 | 28.24.6 | 36.82.4 | 16.72.8 | 17.32.7 | 56.00.0 | 17.84.9 | 31.04.1 | |
| 128 | 60.31.1 | 24.34.1 | 33.03.6 | 37.92.6 | 18.43.0 | 18.93.3 | 56.00.0 | 24.04.4 | 33.94.1 | |
| 3200 | 64.0 | 44.0 | 40.0 | 44.0 | 24.0 | 28.0 | 56.0 | 36.0 | 40.0 | |
| Kimina-Prover-Preview-Distill-7B | 32 | 59.44.1 | 11.75.4 | 45.23.7 | 46.94.5 | 27.02.6 | 27.73.3 | 44.21.3 | 18.13.9 | 35.82.0 |
| 64 | 64.14.6 | 19.45.9 | 48.62.4 | 52.74.3 | 28.82.5 | 30.22.8 | 44.61.4 | 22.32.9 | 36.82.0 | |
| 128 | 69.44.2 | 28.25.4 | 50.62.2 | 57.63.6 | 30.43.0 | 32.01.6 | 45.11.8 | 25.62.5 | 37.62.5 | |
| 3200 | 80.0 | 44.0 | 64.0 | 68.0 | 52.0 | 36.0 | 52.0 | 32.0 | 44.0 | |
| DeepSeek-Prover-V2-7B | 32 | 75.04.4 | 58.64.0 | 52.54.5 | 64.64.1 | 33.02.3 | 35.02.3 | 59.12.9 | 49.33.4 | 38.84.4 |
| 64 | 80.75.3 | 62.14.5 | 57.44.0 | 68.33.1 | 34.72.7 | 36.62.3 | 61.72.5 | 51.62.9 | 43.74.2 | |
| 128 | 85.85.4 | 65.95.3 | 61.43.7 | 71.02.0 | 36.33.6 | 37.92.6 | 64.01.6 | 53.33.1 | 49.94.3 | |
| 3200 | 96.0 | 84.0 | 76.0 | 76.0 | 52.0 | 48.0 | 68.0 | 64.0 | 64.0 |
| Method | Budget | AM-GM Seed | AM-GM I | AM-GM II | Cauchy Seed | Cauchy I | Cauchy II | Misc Seed | Misc I | Misc II |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-Prover-V1.5-RL + RMaxTS | 1×3200 | 60.00.0 | 3.01.7 | 22.02.0 | 24.00.0 | 8.02.8 | 13.03.3 | 44.00.0 | 14.03.5 | 29.01.7 |
| 2×3200 | 60.00.0 | 6.02.0 | 26.02.0 | 24.00.0 | 10.02.0 | 16.00.0 | 44.00.0 | 16.04.0 | 32.00.0 | |
| 4×3200 | 60.0 | 8.0 | 28.0 | 24.0 | 12.0 | 20.0 | 44.0 | 20.0 | 36.0 | |
| InternLM2.5-StepProver + BF | 1×32×600 | 30.83.1 | 0.00.0 | 2.53.1 | 12.00.0 | 0.00.0 | 1.21.9 | 34.02.0 | 2.22.0 | 17.03.9 |
| 4×32×600 | 38.04.5 | 0.00.0 | 9.03.3 | 12.00.0 | 0.00.0 | 3.01.7 | 36.00.0 | 5.01.7 | 21.01.7 | |
| 16×32×600 | 44.0 | 0.0 | 24.0 | 12.0 | 0.0 | 4.0 | 36.0 | 8.0 | 24.0 |
If you find our work helps, please consider starring ⭐ us and citing: