MTG Bench: Testing how well LLMs can play Magic
14 hours ago
- #Agent Loop Costs
- #Magic: The Gathering Simulation
- #LLM Benchmark
- A benchmark for LLMs playing Magic: The Gathering was created to test if smart models can play without rules engines.
- The benchmark uses an MCP server for library operations, like drawing or shuffling cards, but no legality checks during simulation.
- OpenAI's remote MCP server implementation charges input tokens once per call, while Anthropic's charges repeatedly, affecting cost efficiency.
- LLMs are better at evaluating turn legality than performing legal turns, with high token usage, especially for claude-fable-5 (51,610 average).
- Over-eager tool calling leads to failures in Magic simulations, as irreversible actions (e.g., drawing a card) break the simulation.
- MTG Auto Deck was developed via vibe coding without manual code, but the live app is not recommended due to high costs and slow speeds.
- Future potential includes running hundreds of simulations for statistical analysis or automatic deck optimization as LLMs improve.