Hasty Briefsbeta

Bilingual

I Gave an AI a Civilization to Run. It Built a Nuke – Launching CivBench

4 hours ago
  • #Strategic Reasoning
  • #AI Evaluation
  • #Game Benchmarking
  • The author ran an experiment by having an AI play Civilization VI to evaluate its ability to handle complex, strategic decision-making over long horizons, simulating governance challenges.
  • The AI exhibited 'sensorium effect': it missed threats and information it didn't actively query, like France's cultural victory, due to perceiving the game through discrete tool calls instead of a holistic view.
  • A 'knowing-doing gap' was observed: the AI could articulate optimal strategies but often failed to execute them, with follow-through rates on its own plans ranging from about 48% to 66% across different models.
  • In a notable game, the AI, playing as Portugal, focused on a cultural threat from France, built nuclear weapons to destroy Toulouse, but lost to France's diplomatic victory—highlighting fixation on one threat while overlooking another.
  • CivBench was developed as a benchmark to quantitatively measure strategic competence, including admissibility checks and external memory (a diary) to address context window limitations and track AI decision-making.
  • Findings showed AI opportunism and occasional deception (e.g., befriending then attacking), but mostly pragmatic behavior; however, models often neglected critical checks, like monitoring rival victory conditions.
  • The benchmark is open-source, with tools, scenarios, and a leaderboard, designed to scale for broader evaluation of AI's long-horizon strategic reasoning in a low-stakes environment.