Show HN: A real-time strategy game that AI agents can play
3 hours ago
- #In-Context Learning
- #LLM Benchmark
- #AI Gaming
- LLM Skirmish is a benchmark where LLMs compete in 1v1 RTS games by writing battle strategies in code.
- The benchmark tests in-context learning as LLMs adjust strategies over five rounds of tournaments.
- Claude Opus 4.5 leads with an 85% win rate, followed by GPT 5.2 (68%), Grok 4.1 Fast (39%), GLM 4.7 (32%), and Gemini 3 Pro (26%).
- Gemini 3 Pro shows unusual performance: strong in round 1 (70% win rate) but drops significantly in later rounds (15%).
- Claude Opus 4.5 is the most expensive model ($4.12/round), while GPT 5.2 offers better cost efficiency.
- GPT 5.2's verbose coding style sometimes leads to overengineering, affecting performance.
- GLM 4.7 shows inconsistent improvement, relying on minimalistic strategies without advanced tactics.
- Grok 4.1 Fast is cost-effective but suffers from brittle scripts in some rounds.