Hasty Briefsbeta

Bilingual

Outcome-Based Reinforcement Learning to Predict the Future

a year ago
  • #Forecasting
  • #Machine Learning
  • #Reinforcement Learning
  • Reinforcement learning with verifiable rewards (RLVR) enhances large language models in math and coding but faces challenges in real-world domains like forecasting.
  • Outcome-based reinforcement learning for forecasting must handle binary, delayed, and noisy rewards, making standard fine-tuning brittle.
  • Adaptations of Group-Relative Policy Optimisation (GRPO) and ReMax algorithms improve forecasting accuracy, calibration, and hypothetical prediction market betting.
  • Key adaptations include removing per-question variance scaling in GRPO, applying baseline-subtracted advantages in ReMax, and using 100k synthetic questions for training.
  • Lightweight guardrails penalize gibberish, non-English responses, and missing rationales, enabling stable training over 110k events.
  • A 14B model with these adaptations matches frontier baseline accuracy and surpasses it in calibration, demonstrating economic value in forecasting.
  • Scaling ReMax to 110k questions and ensembling predictions yields a model with a Brier score of 0.193 and ECE of 0.042.
  • Hypothetical trading shows a profit of \$127 versus \$92 for the baseline, indicating refined RLVR methods can create economically valuable forecasting tools.