I Gave an AI a Civilization to Run. It Built a Nuke – Launching CivBench

4 hours ago

The author ran an experiment by having an AI play Civilization VI to evaluate its ability to handle complex, strategic decision-making over long horizons, simulating governance challenges.
The AI exhibited 'sensorium effect': it missed threats and information it didn't actively query, like France's cultural victory, due to perceiving the game through discrete tool calls instead of a holistic view.
A 'knowing-doing gap' was observed: the AI could articulate optimal strategies but often failed to execute them, with follow-through rates on its own plans ranging from about 48% to 66% across different models.
In a notable game, the AI, playing as Portugal, focused on a cultural threat from France, built nuclear weapons to destroy Toulouse, but lost to France's diplomatic victory—highlighting fixation on one threat while overlooking another.
CivBench was developed as a benchmark to quantitatively measure strategic competence, including admissibility checks and external memory (a diary) to address context window limitations and track AI decision-making.
Findings showed AI opportunism and occasional deception (e.g., befriending then attacking), but mostly pragmatic behavior; however, models often neglected critical checks, like monitoring rival victory conditions.
The benchmark is open-source, with tools, scenarios, and a leaderboard, designed to scale for broader evaluation of AI's long-horizon strategic reasoning in a low-stakes environment.

Hasty Briefsbeta