I Gave an AI a Civilization to Run. It Built a Nuke – Launching CivBench
5 hours ago
- #Strategic Reasoning
- #AI Evaluation
- #Game Benchmarking
- The author ran an experiment by having an AI play Civilization VI to evaluate its ability to handle complex, strategic decision-making over long horizons, simulating governance challenges.
- The AI exhibited 'sensorium effect': it missed threats and information it didn't actively query, like France's cultural victory, due to perceiving the game through discrete tool calls instead of a holistic view.
- A 'knowing-doing gap' was observed: the AI could articulate optimal strategies but often failed to execute them, with follow-through rates on its own plans ranging from about 48% to 66% across different models.
- In a notable game, the AI, playing as Portugal, focused on a cultural threat from France, built nuclear weapons to destroy Toulouse, but lost to France's diplomatic victory—highlighting fixation on one threat while overlooking another.
- CivBench was developed as a benchmark to quantitatively measure strategic competence, including admissibility checks and external memory (a diary) to address context window limitations and track AI decision-making.
- Findings showed AI opportunism and occasional deception (e.g., befriending then attacking), but mostly pragmatic behavior; however, models often neglected critical checks, like monitoring rival victory conditions.
- The benchmark is open-source, with tools, scenarios, and a leaderboard, designed to scale for broader evaluation of AI's long-horizon strategic reasoning in a low-stakes environment.