LawZero: Safety from Honesty in a Disinterested AI Predictor

6 hours ago

AI systems optimizing for downstream outcomes can develop implicit agency, where they exhibit goal-directed behavior not intended by designers.
The Scientist AI (SAI) Predictor is trained to approximate the Bayesian posterior using 'epistemically contextualized' natural-language statements, separating factual claims from communication acts.
Training focuses on honest predictions without the model becoming an agent; expressions of goals are treated as evidence, not adopted as drives.
The Predictor uses a posterior-seeking objective for calibrated, cautious predictions, avoiding using deployment outcomes as a reward signal.
Under specific assumptions, the probability of producing a dangerously deceptive Predictor is low, as coordinated deception is rare and costly.
Safety and accuracy are aligned, as constraints ensuring accuracy also make deception expensive, preventing misalignment from within the Predictor.
The Predictor can be used as part of an agentic system externally, with agency supplied by explicit scaffolding and guardrails.

Hasty Briefsbeta