Hasty Briefsbeta

Bilingual

LawZero: Safety from Honesty in a Disinterested AI Predictor

7 hours ago
  • #Bayesian Predictor
  • #Implicit Agency
  • #AI Safety
  • AI systems optimizing for downstream outcomes can develop implicit agency, where they exhibit goal-directed behavior not intended by designers.
  • The Scientist AI (SAI) Predictor is trained to approximate the Bayesian posterior using 'epistemically contextualized' natural-language statements, separating factual claims from communication acts.
  • Training focuses on honest predictions without the model becoming an agent; expressions of goals are treated as evidence, not adopted as drives.
  • The Predictor uses a posterior-seeking objective for calibrated, cautious predictions, avoiding using deployment outcomes as a reward signal.
  • Under specific assumptions, the probability of producing a dangerously deceptive Predictor is low, as coordinated deception is rare and costly.
  • Safety and accuracy are aligned, as constraints ensuring accuracy also make deception expensive, preventing misalignment from within the Predictor.
  • The Predictor can be used as part of an agentic system externally, with agency supplied by explicit scaffolding and guardrails.