Step-by-step reasoning verifiers that think
a year ago
- #Chain-of-Thought
- #Machine Learning
- #Process Reward Models
- Introduces ThinkPRM, a verbalized step-wise reward model for verification.
- ThinkPRM uses chain-of-thought (CoT) verification and requires minimal supervision.
- Outperforms baselines like LLM-as-a-Judge and discriminative verifiers on benchmarks.
- Achieves better results with only 1% of process labels compared to PRM800K.
- Excels in out-of-domain evaluations on GPQA-Diamond and LiveCodeBench.
- Scales verification compute more effectively under the same token budget.
- Highlights the value of generative, long CoT PRMs for test-time verification.