Outcome-Based Reinforcement Learning to Predict the Future
a year ago
- #Forecasting
- #Machine Learning
- #Reinforcement Learning
- Reinforcement learning with verifiable rewards (RLVR) enhances large language models in math and coding but faces challenges in real-world domains like forecasting.
- Outcome-based reinforcement learning for forecasting must handle binary, delayed, and noisy rewards, making standard fine-tuning brittle.
- Adaptations of Group-Relative Policy Optimisation (GRPO) and ReMax algorithms improve forecasting accuracy, calibration, and hypothetical prediction market betting.
- Key adaptations include removing per-question variance scaling in GRPO, applying baseline-subtracted advantages in ReMax, and using 100k synthetic questions for training.
- Lightweight guardrails penalize gibberish, non-English responses, and missing rationales, enabling stable training over 110k events.
- A 14B model with these adaptations matches frontier baseline accuracy and surpasses it in calibration, demonstrating economic value in forecasting.
- Scaling ReMax to 110k questions and ensembling predictions yields a model with a Brier score of 0.193 and ECE of 0.042.
- Hypothetical trading shows a profit of \$127 versus \$92 for the baseline, indicating refined RLVR methods can create economically valuable forecasting tools.