Hasty Briefsbeta

Bilingual

CMU TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

a year ago
  • #AI Agents
  • #Work Automation
  • #Benchmarking
  • TheAgentCompany is introduced as an extensible benchmark for evaluating AI agents on real-world professional tasks.
  • AI agents are tested in a simulated small software company environment, mimicking tasks like web browsing, coding, and communication.
  • Baseline agents powered by both closed API-based and open-weights language models (LMs) are evaluated.
  • The most competitive agent can autonomously complete 24% of tasks, indicating potential for simpler task automation.
  • More complex, long-horizon tasks remain beyond the capabilities of current AI systems.
  • The study highlights implications for industry adoption of AI and economic policy regarding labor market effects.