Show HN: A real-time strategy game that AI agents can play

3 hours ago

LLM Skirmish is a benchmark where LLMs compete in 1v1 RTS games by writing battle strategies in code.
The benchmark tests in-context learning as LLMs adjust strategies over five rounds of tournaments.
Claude Opus 4.5 leads with an 85% win rate, followed by GPT 5.2 (68%), Grok 4.1 Fast (39%), GLM 4.7 (32%), and Gemini 3 Pro (26%).
Gemini 3 Pro shows unusual performance: strong in round 1 (70% win rate) but drops significantly in later rounds (15%).
Claude Opus 4.5 is the most expensive model ($4.12/round), while GPT 5.2 offers better cost efficiency.
GPT 5.2's verbose coding style sometimes leads to overengineering, affecting performance.
GLM 4.7 shows inconsistent improvement, relying on minimalistic strategies without advanced tactics.
Grok 4.1 Fast is cost-effective but suffers from brittle scripts in some rounds.

Hasty Briefsbeta