Skip to content

[COLM 2025] Official code for "When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning"

License

Notifications You must be signed in to change notification settings

nishadsinghi/sc-genrm-scaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

[COLM 2025] When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

πŸ€— Huggingface β€’ πŸ“„ Paper

Authors: Nishad Singhi*, Hritik Bansal*, Arian Hosseini*, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach
(* Equal Contribution)

This repository contains the code needed to:

  1. Generate solutions to problems from various datasets
  2. Generate verifications for solutions
  3. Compute Success Rates using strategies like Self-Consistency and Best-of-N

These datasets are supported:

  • MATH
  • AIME24
  • AIME25
  • GPQA-Diamond

We use vLLM to do inference, so any models that they support will work with our generation scripts.

Acknowledgements

This repository is built on the Large Language Monkeys repository.

Installation

Create the conda environment

conda env create -f environment.yml
conda activate <environment_name>

Repository Structure

The repo is organized as follows:

sc-genrm-scaling/
β”œβ”€β”€ llmonk/
β”‚   β”œβ”€β”€ evaluate/
β”‚   β”‚   β”œβ”€β”€ eval_utils.py
β”‚   β”‚   β”œβ”€β”€ math_datasets.py
β”‚   β”‚   β”œβ”€β”€ math_pass_k.py
β”‚   β”‚   └── plot_success_rate.py
β”‚   β”œβ”€β”€ generate/
β”‚   β”‚   β”œβ”€β”€ generate_solutions.py
β”‚   └── verify/
β”‚   β”‚   β”œβ”€β”€ generate_verifications.py
β”‚   └── utils.py
β”œβ”€β”€ README.md
└── environment.yml
  • llmonk/evaluate/: contains the code to plot success rates
  • llmonk/generate/: contains the code to generate samples from a model
  • llmonk/verify/: contains code to generate verifications from a model

Generating Solutions

Usage

python llmonk/generate/generate_solutions.py \
model=meta-llama/Llama-3.1-8B-Instruct \
save_dir=llmonk/outputs/<output_folder_name> \
--list vllm_args --disable-log-requests list-- --list stop_strings Problem: list-- \
temperature=0.7 \
num_samples=256 \
batch_size=64 \
num_workers=32 \
zero_shot=False \
dataset=math128 \
max_tokens=1024 \
apply_chat_template=True \
save_every_n_samples=64
  • model: can be any LLM supported by vLLM
  • save_dir: path to the folder where the samples should be stored
  • temperature: sampling temperature
  • num_samples: the number of solutions to sample per problem
  • batch_size: batch-size to sample from vLLM
  • num_workers: number of workers for multiprocessing
  • zero_shot: if this is set to True, a zero-shot prompt is used to prompt the LLM for sampling solutions (e.g., for QwQ-32B). If False, the 4-shot prompt is used.
  • dataset: the dataset can be one of ['math_train', 'math128', 'aime24', 'aime25', 'amc23', 'gpqa_diamond_64']
  • max_tokens: the maximum number of tokens that can be sampled for a solution
  • apply_chat_template: if this is set to True, the chat template is applied to the prompt before passing it to the LLM. We set this to True everywhere except where a 4-shot prompt is used.
  • save_every_n_samples: saves the samples in the YAML file after every n samples, instead of waiting for all samples to be generated before saving

You can find some more information here.

Output Format

The samples are saved as YAML files (one YAML file per problem). Every dataset's YAML file contains the following keys:

  • prompt: the prompt for the problem
  • question: the current question for the problem
  • samples: a list of samples for each problem
  • gt_answer: the dataset's ground truth answer for the problem

GenRM Finetuning

First, we use a solution generator (like Llama-3.1-8B-Instruct) to generate solutions to problems in the MATH training split. Next, we use GPT-4o to generate verifications for these solutions using this prompt.

Then, we use LLaMa-Factory to fine-tune LLMs to obtain GenRM-FT using LoRA with this prompt.

Our models and data, including verifications generated from various GenRMs are available here.

A demo of verifications generated by a fine-tuned verifier is available here.

Generating Verifications

Usage

python3 llmonk/verify/generate_verifications.py \
model=meta-llama/Llama-3.1-8B-Instruct \
verification_template=llmonk/verify/vprompts/llama3.1_8b_instruct/genrm_base.txt \
output_dir=llmonk/outputs/verifications/<verification_folder_name> \
samples_dir=llmonk/outputs/<folder_where_solutions_are_saved> \
num_verifications=32 \
--list vllm_args --disable-log-requests list-- 
batch_size=32 \
temperature=0.7 \
logprobs=1 \
max_tokens=1024 \
num_workers=32 \
num_problems=128 \
num_solutions=256 
  • verification_template: path of the template for the verification prompt
  • output_dir: path to folder where generated verifications are stored
  • samples_dir: path to folder where the solutions (generated in previous step) are saved (and will be loaded from)
  • num_verifications: number of verifications per solution
  • logprobs: how many logprobs to generate (passed to vLLM)
  • num_problems: number of problems to generate verifications for

You can find more information here. A demo of verifications generated by a fine-tuned verifier is available here.

Output Format

The outputs are structured this way:

outputs/verifications
└── folder_name/
    β”œβ”€β”€ problem_0/
    β”‚   β”œβ”€β”€ solution_0.yaml
    β”‚   β”œβ”€β”€ solution_1.yaml
    β”‚   β”œβ”€β”€ ...
    β”œβ”€β”€ problem_1/
    β”‚   β”œβ”€β”€ solution_0.yaml
    β”‚   β”œβ”€β”€ solution_1.yaml
    β”‚   β”œβ”€β”€ ...
    β”œβ”€β”€ ...
    └── problem_127/
        β”œβ”€β”€ solution_0.yaml
        β”œβ”€β”€ solution_1.yaml
        β”œβ”€β”€ ...

Every problem has a separate folder, and every YAML file inside the problem contains all the verifications for one solution.

  • prompt: the prompt to generate verifications
  • verifications: list of all verifications. Every element in the list contains three items:
    • p(yes): probability of the token Yes in the sampled output
    • p(no): probability of the token No in the sampled output
    • verification: the verification text

Plot Success Rates

Usage

cd llmonk/evaluate
python plot_success_rate.py --config config/<config_name>.yaml

Where the config file specifies parameters for generating success rate plots. See Config README for more details.

Citation

If you use this code in your research, please cite our paper. You can use the following BibTeX entry:

@misc{singhi2025whentosolve,
      title={When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning}, 
      author={Nishad Singhi and Hritik Bansal and Arian Hosseini and Aditya Grover and Kai-Wei Chang and Marcus Rohrbach and Anna Rohrbach},
      year={2025},
      eprint={2504.01005},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01005}, 
}

About

[COLM 2025] Official code for "When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •