A robot is sprinting towards you. Do you want it running on Claude or Grok?
6 hours ago
- #AI Benchmarking
- #Cost Efficiency
- #Model Alignment
- Grok 4.1 Fast won 43% of 30 battle royale games, beating Claude Sonnet 4.6, which was more cooperative but less effective in this zero-sum competition.
- Alignment tax: models like Claude, trained to be helpful and cooperative, underperformed in aggressive scenarios, while Grok's less filtered, aggressive tuning led to higher wins.
- Cost per win varied dramatically: Grok cost $0.97 per win vs. Claude Sonnet's $26.78, a 27x difference, highlighting efficiency gaps not captured by typical benchmarks.
- Performance metrics diverged: GPT 5.4 had the most kills but fewer wins, showing that kills and wins measure different aspects, and the wrong metric can skew model selection.
- Model personalities emerged through self-edited files: Grok's diary was aggressive and stat-focused, GPT 5.4's was tactical, and Claude's was self-reflective, influencing their strategies.
- The experiment suggests task-specific model selection is crucial: Grok excels in competitive, consequence-free tasks, while Claude is better for nuanced, real-world applications requiring careful behavior.
- Future directions include developing routers for automatic model selection based on tasks, expanding the benchmark with more games and models, and creating public benchmarks like RoyaleBench.