π EnumerateβConjectureβProve: Formally Solving Answer-Construction Problems in Math Competitions
- Background: We identify that current mathematical reasoning approaches either generate creative answers (LLMs) but fail to verify them formally, or verify rigorously (symbolic provers) but cannot efficiently generate answers.
- Contribution:
- We introduce ECP framework: a modular neuro-symbolic pipeline that integrates a feedback-driven autoformalization stage, LLM-based enumeration and conjecturing, and theorem-proving in Lean.
- We introduce ConstructiveBench: an autoformalized dataset of 3,431 competition-grade answer-construction problems, each includes informal and formal problem statements, ground-truth answers and metadata, enabling end-to-end formal verification (answer-construction + theorem-proving).
- Location:
data/dataset/constructivebench.json
- Description: Curated Olympiad-style problems with metadata and aligned Lean formalizations. Each entry includes:
- Problem statement
- Category (e.g., Algebra, Combinatorics)
- Formal answer in Lean
- Full formal theorem
- Answer-construction alignment parts (header, answer, theorem with/without answer)
{
"name": "IMO2011SLC4",
"category": "Combinatorics",
"source": "IMO/2011",
"problem": "...",
"answer": "The greatest such number k is 3",
"formalization": "...",
"...": "..."
}
Below, we show the problem sources and problem domains in ConstructiveBench.
- Location:
data/dataset/putnam.json
- Description: A selected subset of answer-construction problems from the PutnamBench dataset, chosen to evaluate the ECP pipeline on university-level problems.
We show the end-to-end answer-construction ("Conjecturer Model") and theorem proving ("Prover Model") accuracy with Pass@32 metric on 3,431 ConstructiveBench problems, comparing the CoT baseline vs. ECP for answer-construction.
- Python 3.10
- CUDA >= 11.8
- Git LFS
- Lean Proof Assistant
- (Optional) Chrome and Chrome Driver
-
Initialize Git LFS
(Skip if already set up.)git lfs install
-
Install Lean
(Skip if already set up.)curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh source ~/.bashrc
Verify installation:
lean --version
-
Clone the Repository
git clone --recursive https://github.com/JackSun200312/ECP cd ECP
-
Create a Python Virtual Environment
We recommend usingvenv
:python -m venv imosolver source imosolver/bin/activate pip install -r requirements.txt
-
Build Lean Environment
Build for both the newest version andv4.9.0-rc1
(for prover models). This may take around 30 minutes.cd Formalization lake update lake build Main cd .. cd prover/mathlib4 lake build cd ../..
-
Set Up LLM API Keys
Either add them to your shell file or editappl.yaml
directly:echo 'export OPENAI_API_KEY="your_openai_key_here"' >> ~/.bashrc echo 'export DEEPSEEK_API_KEY="your_deepseek_key_here"' >> ~/.bashrc source ~/.bashrc
ECP/
βββ prover/ # Adapted from Goedel-Prover, contains utilities for running prover models
βββ src/
β βββ ecp/
β βββ agent.py # Multi-agent framework for Enumerate β Conjecture β Prove (ECP)
β βββ main.py # Main entry point for experiments
β βββ utils.py # Utility functions
β βββ scripts/
β βββ dataset/
β βββ deploy/
β βββ finetune/
β βββ lean/
β βββ trace/
βββ data/
βββ dataset/
βββ constructivebench.json # ConstructiveBench dataset
βββ putnam.json # PutnamBench answer-construction subset
The file src/ecp/main.py
provides a unified interface for three pipelines:
answer_gen
: Full ECP pipeline (Enumerate β Conjecture β Verify)autoformalize
: Generate Lean formalizations from informal problems/answersproof_gen
: Use a formal prover to generate complete Lean proofs
- Use
--problem_path
to specify the dataset.- Main option:
constructivebench
(recommended) - For testing:
test
(runs a single case)
- Main option:
- Set
--mode
to one of:answer_gen
autoformalize
proof_gen
--enable_enumerator
:True
: Run full ECP (enumerator + conjurer)False
: Skip enumeration (Chain-of-Thought baseline)
--problem_name
:"all"
(default) to process all entries- Or a comma-separated list of specific problem names
python src/ecp/main.py --mode autoformalize --problem_path constructivebench
python src/ecp/main.py --mode answer_gen --problem_path constructivebench --enable_enumerator true
Output Location:
output/data/dataset/constructivebench.json/deepseek-chat-code/
(To run the CoT baseline, set--enable_enumerator false
.)
After generating formalizations and conjectures (via answer_gen
), run:
python src/ecp/main.py --mode proof_gen --problem_path constructivebench
Note: Proof generation uses Goedel-Prover by default. You can override with models like
deepseek-ai/DeepSeek-Prover-V2-7B
orAI-MO/Kimina-Prover-Preview-Distill-7B
.
--enumerator_model
:deepseek-chat
--conjecturer_model
:deepseek-chat
--prover_model
:Goedel-LM/Goedel-Prover-SFT
--max_tokens
:500
--timeout
:60
(seconds)--pass_at_n
:32
(Pass@n metric for proof generation)--gpu
:1
(number of GPUs for proof generation)--use_embedding_search
:False
(Set toTrue
only if you have GPU resources for embedding-based Lean retrieval.)
You can override any of these options. Check src/ecp/main.py
for the full list of arguments.
After experiments, summarize answer/proving accuracies for a dataset (constructivebench
, putnam
, or test
):
python summarize_result.py --base <dataset>
If you find this work useful, please cite:
@misc{sun2025enumerateconjectureproveformallysolvinganswerconstruction,
title = {Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions},
author = {Jialiang Sun and Yuzhi Tang and Ao Li and Chris J. Maddison and Kuldeep S. Meel},
year = {2025},
eprint = {2505.18492},
archivePrefix= {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2505.18492},
}
The prover part of this repository use code from Goedel-Prover. Some of raw problems from the following datasets were included in ConstructiveBench, as mentioned in the paper: