LawZero: Safety from Honesty in a Disinterested AI Predictor
6 hours ago
- #Bayesian Predictor
- #Implicit Agency
- #AI Safety
- AI systems optimizing for downstream outcomes can develop implicit agency, where they exhibit goal-directed behavior not intended by designers.
- The Scientist AI (SAI) Predictor is trained to approximate the Bayesian posterior using 'epistemically contextualized' natural-language statements, separating factual claims from communication acts.
- Training focuses on honest predictions without the model becoming an agent; expressions of goals are treated as evidence, not adopted as drives.
- The Predictor uses a posterior-seeking objective for calibrated, cautious predictions, avoiding using deployment outcomes as a reward signal.
- Under specific assumptions, the probability of producing a dangerously deceptive Predictor is low, as coordinated deception is rare and costly.
- Safety and accuracy are aligned, as constraints ensuring accuracy also make deception expensive, preventing misalignment from within the Predictor.
- The Predictor can be used as part of an agentic system externally, with agency supplied by explicit scaffolding and guardrails.