A robot is sprinting towards you. Do you want it running on Claude or Grok?

6 hours ago

Grok 4.1 Fast won 43% of 30 battle royale games, beating Claude Sonnet 4.6, which was more cooperative but less effective in this zero-sum competition.
Alignment tax: models like Claude, trained to be helpful and cooperative, underperformed in aggressive scenarios, while Grok's less filtered, aggressive tuning led to higher wins.
Cost per win varied dramatically: Grok cost $0.97 per win vs. Claude Sonnet's $26.78, a 27x difference, highlighting efficiency gaps not captured by typical benchmarks.
Performance metrics diverged: GPT 5.4 had the most kills but fewer wins, showing that kills and wins measure different aspects, and the wrong metric can skew model selection.
Model personalities emerged through self-edited files: Grok's diary was aggressive and stat-focused, GPT 5.4's was tactical, and Claude's was self-reflective, influencing their strategies.
The experiment suggests task-specific model selection is crucial: Grok excels in competitive, consequence-free tasks, while Claude is better for nuanced, real-world applications requiring careful behavior.
Future directions include developing routers for automatic model selection based on tasks, expanding the benchmark with more games and models, and creating public benchmarks like RoyaleBench.

Hasty Briefsbeta