Skip to content

haoyuzhao123/LeanIneqComp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 Performance of Different Theorem Provers

General-Purpose Models

Method Budget AM-GM Seed AM-GM I AM-GM II Cauchy Seed Cauchy I Cauchy II Misc Seed Misc I Misc II
DeepSeek-R1-Distill-Qwen-32B (w/o thinking) 32 48.21.9 3.53.3 16.23.0 28.03.3 17.03.2 15.03.0 41.43.7 13.44.5 15.44.4
64 49.01.7 6.54.1 18.42.4 30.63.2 19.52.8 16.82.7 44.53.2 17.74.0 20.24.8
128 49.92.0 10.64.2 20.02.5 32.62.9 21.83.2 19.02.6 47.43.1 21.13.7 25.44.2
3200 52.0 40.0 36.0 44.0 32.0 28.0 52.0 36.0 36.0
DeepSeek-R1-Distill-Qwen-32B (w thinking) 32 48.81.6 10.93.8 21.13.1 42.92.5 27.03.4 18.42.4 50.52.3 18.94.6 22.04.0
64 49.51.9 14.54.4 23.03.4 44.52.4 30.32.9 20.62.3 51.90.6 23.74.9 26.23.1
128 50.92.1 19.24.1 26.14.3 46.22.3 32.62.7 22.12.0 52.00.0 28.03.9 29.42.7
3200 60.0 44.0 44.0 56.0 40.0 24.0 52.0 36.0 40.0

Whole-Proof Generation Methods

Method Budget AM-GM Seed AM-GM I AM-GM II Cauchy Seed Cauchy I Cauchy II Misc Seed Misc I Misc II
DeepSeek-Prover-V1.5-RL 32 48.13.0 0.00.4 8.21.5 14.93.2 2.91.8 4.41.4 40.22.8 12.41.1 12.22.5
64 50.62.9 0.10.6 9.01.7 17.02.7 3.71.1 5.01.9 42.12.3 12.71.7 13.82.9
128 52.22.1 0.20.8 9.82.0 18.72.7 4.00.0 6.12.3 43.21.6 13.32.2 16.22.9
3200 60.0 4.0 24.0 24.0 4.0 12.0 44.0 20.0 28.0
Goedel-Prover-SFT 32 48.62.9 0.41.2 14.03.2 34.82.5 12.43.5 21.53.4 47.01.7 14.43.1 24.61.9
64 50.62.6 0.81.6 16.62.8 36.21.9 15.83.4 24.62.9 47.80.9 16.62.5 25.51.9
128 52.21.4 1.31.9 18.62.2 37.11.8 19.42.9 26.91.8 48.00.0 17.92.6 26.42.5
3200 60.0 4.0 24.0 40.0 32.0 28.0 48.0 24.0 36.0
STP (w/o miniF2F valid) 32 59.11.9 14.34.4 23.24.5 35.22.4 14.62.7 16.02.6 55.61.3 12.65.0 27.63.6
64 60.10.6 18.54.1 28.24.6 36.82.4 16.72.8 17.32.7 56.00.0 17.84.9 31.04.1
128 60.31.1 24.34.1 33.03.6 37.92.6 18.43.0 18.93.3 56.00.0 24.04.4 33.94.1
3200 64.0 44.0 40.0 44.0 24.0 28.0 56.0 36.0 40.0
Kimina-Prover-Preview-Distill-7B 32 59.44.1 11.75.4 45.23.7 46.94.5 27.02.6 27.73.3 44.21.3 18.13.9 35.82.0
64 64.14.6 19.45.9 48.62.4 52.74.3 28.82.5 30.22.8 44.61.4 22.32.9 36.82.0
128 69.44.2 28.25.4 50.62.2 57.63.6 30.43.0 32.01.6 45.11.8 25.62.5 37.62.5
3200 80.0 44.0 64.0 68.0 52.0 36.0 52.0 32.0 44.0
DeepSeek-Prover-V2-7B 32 75.04.4 58.64.0 52.54.5 64.64.1 33.02.3 35.02.3 59.12.9 49.33.4 38.84.4
64 80.75.3 62.14.5 57.44.0 68.33.1 34.72.7 36.62.3 61.72.5 51.62.9 43.74.2
128 85.85.4 65.95.3 61.43.7 71.02.0 36.33.6 37.92.6 64.01.6 53.33.1 49.94.3
3200 96.0 84.0 76.0 76.0 52.0 48.0 68.0 64.0 64.0

Tree-Search Methods

Method Budget AM-GM Seed AM-GM I AM-GM II Cauchy Seed Cauchy I Cauchy II Misc Seed Misc I Misc II
DeepSeek-Prover-V1.5-RL + RMaxTS 1×3200 60.00.0 3.01.7 22.02.0 24.00.0 8.02.8 13.03.3 44.00.0 14.03.5 29.01.7
2×3200 60.00.0 6.02.0 26.02.0 24.00.0 10.02.0 16.00.0 44.00.0 16.04.0 32.00.0
4×3200 60.0 8.0 28.0 24.0 12.0 20.0 44.0 20.0 36.0
InternLM2.5-StepProver + BF 1×32×600 30.83.1 0.00.0 2.53.1 12.00.0 0.00.0 1.21.9 34.02.0 2.22.0 17.03.9
4×32×600 38.04.5 0.00.0 9.03.3 12.00.0 0.00.0 3.01.7 36.00.0 5.01.7 21.01.7
16×32×600 44.0 0.0 24.0 12.0 0.0 4.0 36.0 8.0 24.0

📖 Citation

If you find our work helps, please consider starring ⭐ us and citing:


About

An inequality benchmark for theorem proving

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •