MTG Bench: Testing how well LLMs can play Magic

13 hours ago

A benchmark for LLMs playing Magic: The Gathering was created to test if smart models can play without rules engines.
The benchmark uses an MCP server for library operations, like drawing or shuffling cards, but no legality checks during simulation.
OpenAI's remote MCP server implementation charges input tokens once per call, while Anthropic's charges repeatedly, affecting cost efficiency.
LLMs are better at evaluating turn legality than performing legal turns, with high token usage, especially for claude-fable-5 (51,610 average).
Over-eager tool calling leads to failures in Magic simulations, as irreversible actions (e.g., drawing a card) break the simulation.
MTG Auto Deck was developed via vibe coding without manual code, but the live app is not recommended due to high costs and slow speeds.
Future potential includes running hundreds of simulations for statistical analysis or automatic deck optimization as LLMs improve.

Hasty Briefsbeta