Hasty Briefsbeta

Bilingual

Step-by-step reasoning verifiers that think

a year ago
  • #Chain-of-Thought
  • #Machine Learning
  • #Process Reward Models
  • Introduces ThinkPRM, a verbalized step-wise reward model for verification.
  • ThinkPRM uses chain-of-thought (CoT) verification and requires minimal supervision.
  • Outperforms baselines like LLM-as-a-Judge and discriminative verifiers on benchmarks.
  • Achieves better results with only 1% of process labels compared to PRM800K.
  • Excels in out-of-domain evaluations on GPQA-Diamond and LiveCodeBench.
  • Scales verification compute more effectively under the same token budget.
  • Highlights the value of generative, long CoT PRMs for test-time verification.