Outcome-Based Reinforcement Learning to Predict the Future

a year ago

Reinforcement learning with verifiable rewards (RLVR) enhances large language models in math and coding but faces challenges in real-world domains like forecasting.
Outcome-based reinforcement learning for forecasting must handle binary, delayed, and noisy rewards, making standard fine-tuning brittle.
Adaptations of Group-Relative Policy Optimisation (GRPO) and ReMax algorithms improve forecasting accuracy, calibration, and hypothetical prediction market betting.
Key adaptations include removing per-question variance scaling in GRPO, applying baseline-subtracted advantages in ReMax, and using 100k synthetic questions for training.
Lightweight guardrails penalize gibberish, non-English responses, and missing rationales, enabling stable training over 110k events.
A 14B model with these adaptations matches frontier baseline accuracy and surpasses it in calibration, demonstrating economic value in forecasting.
Scaling ReMax to 110k questions and ensembling predictions yields a model with a Brier score of 0.193 and ECE of 0.042.
Hypothetical trading shows a profit of \$127 versus \$92 for the baseline, indicating refined RLVR methods can create economically valuable forecasting tools.

Hasty Briefsbeta