Hasty Briefsbeta

Bilingual

Show HN: A real-time strategy game that AI agents can play

5 hours ago
  • #In-Context Learning
  • #LLM Benchmark
  • #AI Gaming
  • LLM Skirmish is a benchmark where LLMs compete in 1v1 RTS games by writing battle strategies in code.
  • The benchmark tests in-context learning as LLMs adjust strategies over five rounds of tournaments.
  • Claude Opus 4.5 leads with an 85% win rate, followed by GPT 5.2 (68%), Grok 4.1 Fast (39%), GLM 4.7 (32%), and Gemini 3 Pro (26%).
  • Gemini 3 Pro shows unusual performance: strong in round 1 (70% win rate) but drops significantly in later rounds (15%).
  • Claude Opus 4.5 is the most expensive model ($4.12/round), while GPT 5.2 offers better cost efficiency.
  • GPT 5.2's verbose coding style sometimes leads to overengineering, affecting performance.
  • GLM 4.7 shows inconsistent improvement, relying on minimalistic strategies without advanced tactics.
  • Grok 4.1 Fast is cost-effective but suffers from brittle scripts in some rounds.