[COLM 2025] When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
π€ Huggingface β’ π Paper
Authors: Nishad Singhi*, Hritik Bansal*, Arian Hosseini*, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach
(* Equal Contribution)
This repository contains the code needed to:
- Generate solutions to problems from various datasets
- Generate verifications for solutions
- Compute Success Rates using strategies like Self-Consistency and Best-of-N
These datasets are supported:
- MATH
- AIME24
- AIME25
- GPQA-Diamond
We use vLLM to do inference, so any models that they support will work with our generation scripts.
This repository is built on the Large Language Monkeys repository.
Create the conda environment
conda env create -f environment.yml
conda activate <environment_name>
The repo is organized as follows:
sc-genrm-scaling/
βββ llmonk/
β βββ evaluate/
β β βββ eval_utils.py
β β βββ math_datasets.py
β β βββ math_pass_k.py
β β βββ plot_success_rate.py
β βββ generate/
β β βββ generate_solutions.py
β βββ verify/
β β βββ generate_verifications.py
β βββ utils.py
βββ README.md
βββ environment.yml
llmonk/evaluate/
: contains the code to plot success ratesllmonk/generate/
: contains the code to generate samples from a modelllmonk/verify/
: contains code to generate verifications from a model
python llmonk/generate/generate_solutions.py \
model=meta-llama/Llama-3.1-8B-Instruct \
save_dir=llmonk/outputs/<output_folder_name> \
--list vllm_args --disable-log-requests list-- --list stop_strings Problem: list-- \
temperature=0.7 \
num_samples=256 \
batch_size=64 \
num_workers=32 \
zero_shot=False \
dataset=math128 \
max_tokens=1024 \
apply_chat_template=True \
save_every_n_samples=64
model
: can be any LLM supported by vLLMsave_dir
: path to the folder where the samples should be storedtemperature
: sampling temperaturenum_samples
: the number of solutions to sample per problembatch_size
: batch-size to sample from vLLMnum_workers
: number of workers for multiprocessingzero_shot
: if this is set toTrue
, a zero-shot prompt is used to prompt the LLM for sampling solutions (e.g., for QwQ-32B). IfFalse
, the 4-shot prompt is used.dataset
: the dataset can be one of['math_train', 'math128', 'aime24', 'aime25', 'amc23', 'gpqa_diamond_64']
max_tokens
: the maximum number of tokens that can be sampled for a solutionapply_chat_template
: if this is set toTrue
, the chat template is applied to the prompt before passing it to the LLM. We set this to True everywhere except where a 4-shot prompt is used.save_every_n_samples
: saves the samples in the YAML file after everyn
samples, instead of waiting for all samples to be generated before saving
You can find some more information here.
The samples are saved as YAML files (one YAML file per problem). Every dataset's YAML file contains the following keys:
prompt
: the prompt for the problemquestion
: the current question for the problemsamples
: a list of samples for each problemgt_answer
: the dataset's ground truth answer for the problem
First, we use a solution generator (like Llama-3.1-8B-Instruct) to generate solutions to problems in the MATH training split. Next, we use GPT-4o to generate verifications for these solutions using this prompt.
- GenRM fine-tuning data:
Then, we use LLaMa-Factory to fine-tune LLMs to obtain GenRM-FT using LoRA with this prompt.
- Trained GenRM-FT verifiers:
Our models and data, including verifications generated from various GenRMs are available here.
A demo of verifications generated by a fine-tuned verifier is available here.
python3 llmonk/verify/generate_verifications.py \
model=meta-llama/Llama-3.1-8B-Instruct \
verification_template=llmonk/verify/vprompts/llama3.1_8b_instruct/genrm_base.txt \
output_dir=llmonk/outputs/verifications/<verification_folder_name> \
samples_dir=llmonk/outputs/<folder_where_solutions_are_saved> \
num_verifications=32 \
--list vllm_args --disable-log-requests list--
batch_size=32 \
temperature=0.7 \
logprobs=1 \
max_tokens=1024 \
num_workers=32 \
num_problems=128 \
num_solutions=256
verification_template
: path of the template for the verification promptoutput_dir
: path to folder where generated verifications are storedsamples_dir
: path to folder where the solutions (generated in previous step) are saved (and will be loaded from)num_verifications
: number of verifications per solutionlogprobs
: how many logprobs to generate (passed to vLLM)num_problems
: number of problems to generate verifications for
You can find more information here. A demo of verifications generated by a fine-tuned verifier is available here.
The outputs are structured this way:
outputs/verifications
βββ folder_name/
βββ problem_0/
β βββ solution_0.yaml
β βββ solution_1.yaml
β βββ ...
βββ problem_1/
β βββ solution_0.yaml
β βββ solution_1.yaml
β βββ ...
βββ ...
βββ problem_127/
βββ solution_0.yaml
βββ solution_1.yaml
βββ ...
Every problem has a separate folder, and every YAML file inside the problem contains all the verifications for one solution.
prompt
: the prompt to generate verificationsverifications
: list of all verifications. Every element in the list contains three items:p(yes)
: probability of the tokenYes
in the sampled outputp(no)
: probability of the tokenNo
in the sampled outputverification
: the verification text
cd llmonk/evaluate
python plot_success_rate.py --config config/<config_name>.yaml
Where the config file specifies parameters for generating success rate plots. See Config README for more details.
If you use this code in your research, please cite our paper. You can use the following BibTeX entry:
@misc{singhi2025whentosolve,
title={When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning},
author={Nishad Singhi and Hritik Bansal and Arian Hosseini and Aditya Grover and Kai-Wei Chang and Marcus Rohrbach and Anna Rohrbach},
year={2025},
eprint={2504.01005},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.01005},
}