Hasty Briefsbeta

Bilingual

MTG Bench: Testing how well LLMs can play Magic

13 hours ago
  • #Agent Loop Costs
  • #Magic: The Gathering Simulation
  • #LLM Benchmark
  • A benchmark for LLMs playing Magic: The Gathering was created to test if smart models can play without rules engines.
  • The benchmark uses an MCP server for library operations, like drawing or shuffling cards, but no legality checks during simulation.
  • OpenAI's remote MCP server implementation charges input tokens once per call, while Anthropic's charges repeatedly, affecting cost efficiency.
  • LLMs are better at evaluating turn legality than performing legal turns, with high token usage, especially for claude-fable-5 (51,610 average).
  • Over-eager tool calling leads to failures in Magic simulations, as irreversible actions (e.g., drawing a card) break the simulation.
  • MTG Auto Deck was developed via vibe coding without manual code, but the live app is not recommended due to high costs and slow speeds.
  • Future potential includes running hundreds of simulations for statistical analysis or automatic deck optimization as LLMs improve.