Step-by-step reasoning verifiers that think

a year ago

Introduces ThinkPRM, a verbalized step-wise reward model for verification.
ThinkPRM uses chain-of-thought (CoT) verification and requires minimal supervision.
Outperforms baselines like LLM-as-a-Judge and discriminative verifiers on benchmarks.
Achieves better results with only 1% of process labels compared to PRM800K.
Excels in out-of-domain evaluations on GPQA-Diamond and LiveCodeBench.
Scales verification compute more effectively under the same token budget.
Highlights the value of generative, long CoT PRMs for test-time verification.

Hasty Briefsbeta