Hasty Briefsbeta

Bilingual

A robot is sprinting towards you. Do you want it running on Claude or Grok?

6 hours ago
  • #AI Benchmarking
  • #Cost Efficiency
  • #Model Alignment
  • Grok 4.1 Fast won 43% of 30 battle royale games, beating Claude Sonnet 4.6, which was more cooperative but less effective in this zero-sum competition.
  • Alignment tax: models like Claude, trained to be helpful and cooperative, underperformed in aggressive scenarios, while Grok's less filtered, aggressive tuning led to higher wins.
  • Cost per win varied dramatically: Grok cost $0.97 per win vs. Claude Sonnet's $26.78, a 27x difference, highlighting efficiency gaps not captured by typical benchmarks.
  • Performance metrics diverged: GPT 5.4 had the most kills but fewer wins, showing that kills and wins measure different aspects, and the wrong metric can skew model selection.
  • Model personalities emerged through self-edited files: Grok's diary was aggressive and stat-focused, GPT 5.4's was tactical, and Claude's was self-reflective, influencing their strategies.
  • The experiment suggests task-specific model selection is crucial: Grok excels in competitive, consequence-free tasks, while Claude is better for nuanced, real-world applications requiring careful behavior.
  • Future directions include developing routers for automatic model selection based on tasks, expanding the benchmark with more games and models, and creating public benchmarks like RoyaleBench.