[COLM 2025] When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Authors: Nishad Singhi*, Hritik Bansal*, Arian Hosseini*, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach
(* Equal Contribution)

This repository contains the code needed to:

Generate solutions to problems from various datasets
Generate verifications for solutions
Compute Success Rates using strategies like Self-Consistency and Best-of-N

These datasets are supported:

MATH
AIME24
AIME25
GPQA-Diamond

We use vLLM to do inference, so any models that they support will work with our generation scripts.

Acknowledgements

This repository is built on the Large Language Monkeys repository.

Installation

Create the conda environment

conda env create -f environment.yml
conda activate <environment_name>

Repository Structure

The repo is organized as follows:

sc-genrm-scaling/
├── llmonk/
│   ├── evaluate/
│   │   ├── eval_utils.py
│   │   ├── math_datasets.py
│   │   ├── math_pass_k.py
│   │   └── plot_success_rate.py
│   ├── generate/
│   │   ├── generate_solutions.py
│   └── verify/
│   │   ├── generate_verifications.py
│   └── utils.py
├── README.md
└── environment.yml

llmonk/evaluate/: contains the code to plot success rates
llmonk/generate/: contains the code to generate samples from a model
llmonk/verify/: contains code to generate verifications from a model

Generating Solutions

Usage

python llmonk/generate/generate_solutions.py \
model=meta-llama/Llama-3.1-8B-Instruct \
save_dir=llmonk/outputs/<output_folder_name> \
--list vllm_args --disable-log-requests list-- --list stop_strings Problem: list-- \
temperature=0.7 \
num_samples=256 \
batch_size=64 \
num_workers=32 \
zero_shot=False \
dataset=math128 \
max_tokens=1024 \
apply_chat_template=True \
save_every_n_samples=64

model: can be any LLM supported by vLLM
save_dir: path to the folder where the samples should be stored
temperature: sampling temperature
num_samples: the number of solutions to sample per problem
batch_size: batch-size to sample from vLLM
num_workers: number of workers for multiprocessing
zero_shot: if this is set to True, a zero-shot prompt is used to prompt the LLM for sampling solutions (e.g., for QwQ-32B). If False, the 4-shot prompt is used.
dataset: the dataset can be one of ['math_train', 'math128', 'aime24', 'aime25', 'amc23', 'gpqa_diamond_64']
max_tokens: the maximum number of tokens that can be sampled for a solution
apply_chat_template: if this is set to True, the chat template is applied to the prompt before passing it to the LLM. We set this to True everywhere except where a 4-shot prompt is used.
save_every_n_samples: saves the samples in the YAML file after every n samples, instead of waiting for all samples to be generated before saving

You can find some more information here.

Output Format

The samples are saved as YAML files (one YAML file per problem). Every dataset's YAML file contains the following keys:

prompt: the prompt for the problem
question: the current question for the problem
samples: a list of samples for each problem
gt_answer: the dataset's ground truth answer for the problem

GenRM Finetuning

First, we use a solution generator (like Llama-3.1-8B-Instruct) to generate solutions to problems in the MATH training split. Next, we use GPT-4o to generate verifications for these solutions using this prompt.

GenRM fine-tuning data:
- Llama-3.1-8B-Instruct
- Qwen-2.5.-7B-Instruct

Then, we use LLaMa-Factory to fine-tune LLMs to obtain GenRM-FT using LoRA with this prompt.

Trained GenRM-FT verifiers:
- Llama-3.1-8B-Instruct
- Qwen-2.5.-7B-Instruct

Our models and data, including verifications generated from various GenRMs are available here.

A demo of verifications generated by a fine-tuned verifier is available here.

Generating Verifications

Usage

python3 llmonk/verify/generate_verifications.py \
model=meta-llama/Llama-3.1-8B-Instruct \
verification_template=llmonk/verify/vprompts/llama3.1_8b_instruct/genrm_base.txt \
output_dir=llmonk/outputs/verifications/<verification_folder_name> \
samples_dir=llmonk/outputs/<folder_where_solutions_are_saved> \
num_verifications=32 \
--list vllm_args --disable-log-requests list-- 
batch_size=32 \
temperature=0.7 \
logprobs=1 \
max_tokens=1024 \
num_workers=32 \
num_problems=128 \
num_solutions=256

verification_template: path of the template for the verification prompt
output_dir: path to folder where generated verifications are stored
samples_dir: path to folder where the solutions (generated in previous step) are saved (and will be loaded from)
num_verifications: number of verifications per solution
logprobs: how many logprobs to generate (passed to vLLM)
num_problems: number of problems to generate verifications for

You can find more information here. A demo of verifications generated by a fine-tuned verifier is available here.

Output Format

The outputs are structured this way:

outputs/verifications
└── folder_name/
    ├── problem_0/
    │   ├── solution_0.yaml
    │   ├── solution_1.yaml
    │   ├── ...
    ├── problem_1/
    │   ├── solution_0.yaml
    │   ├── solution_1.yaml
    │   ├── ...
    ├── ...
    └── problem_127/
        ├── solution_0.yaml
        ├── solution_1.yaml
        ├── ...

Every problem has a separate folder, and every YAML file inside the problem contains all the verifications for one solution.

prompt: the prompt to generate verifications
verifications: list of all verifications. Every element in the list contains three items:
- p(yes): probability of the token Yes in the sampled output
- p(no): probability of the token No in the sampled output
- verification: the verification text

Plot Success Rates

Usage

cd llmonk/evaluate
python plot_success_rate.py --config config/<config_name>.yaml

Where the config file specifies parameters for generating success rate plots. See Config README for more details.

Citation

If you use this code in your research, please cite our paper. You can use the following BibTeX entry:

@misc{singhi2025whentosolve,
      title={When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning}, 
      author={Nishad Singhi and Hritik Bansal and Arian Hosseini and Aditya Grover and Kai-Wei Chang and Marcus Rohrbach and Anna Rohrbach},
      year={2025},
      eprint={2504.01005},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01005}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
llmonk		llmonk
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[COLM 2025] When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Acknowledgements

Installation

Repository Structure

Generating Solutions

Usage

Output Format

GenRM Finetuning

Generating Verifications

Usage

Output Format

Plot Success Rates

Usage

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

nishadsinghi/sc-genrm-scaling

Folders and files

Latest commit

History

Repository files navigation

[COLM 2025] When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Acknowledgements

Installation

Repository Structure

Generating Solutions

Usage

Output Format

GenRM Finetuning

Generating Verifications

Usage

Output Format

Plot Success Rates

Usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages