Hasty Briefsbeta

Bilingual

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

a year ago
  • #Benchmarking
  • #Autonomous Agents
  • #Artificial Intelligence
  • Vending-Bench is a benchmark designed to test the long-term coherence of autonomous agents, specifically LLM-based agents, in managing a vending machine business scenario.
  • The benchmark tasks agents with balancing inventories, placing orders, setting prices, and handling daily fees over long time horizons (>20M tokens per run).
  • Experiments show high variance in performance across LLMs, with some models like Claude 3.5 Sonnet and o3-mini performing well, while others fail due to misinterpretations, forgotten orders, or 'meltdown' loops.
  • No clear correlation was found between failures and the model's context window becoming full, suggesting memory limits are not the primary cause of breakdowns.
  • Vending-Bench also tests models' ability to acquire capital, a critical factor in many hypothetical dangerous AI scenarios.
  • The benchmark aims to help prepare for the advent of stronger AI systems by highlighting performance variance over long time horizons.