Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

a year ago

Vending-Bench is a benchmark designed to test the long-term coherence of autonomous agents, specifically LLM-based agents, in managing a vending machine business scenario.
The benchmark tasks agents with balancing inventories, placing orders, setting prices, and handling daily fees over long time horizons (>20M tokens per run).
Experiments show high variance in performance across LLMs, with some models like Claude 3.5 Sonnet and o3-mini performing well, while others fail due to misinterpretations, forgotten orders, or 'meltdown' loops.
No clear correlation was found between failures and the model's context window becoming full, suggesting memory limits are not the primary cause of breakdowns.
Vending-Bench also tests models' ability to acquire capital, a critical factor in many hypothetical dangerous AI scenarios.
The benchmark aims to help prepare for the advent of stronger AI systems by highlighting performance variance over long time horizons.

Hasty Briefsbeta