CMU TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

a year ago

TheAgentCompany is introduced as an extensible benchmark for evaluating AI agents on real-world professional tasks.
AI agents are tested in a simulated small software company environment, mimicking tasks like web browsing, coding, and communication.
Baseline agents powered by both closed API-based and open-weights language models (LMs) are evaluated.
The most competitive agent can autonomously complete 24% of tasks, indicating potential for simpler task automation.
More complex, long-horizon tasks remain beyond the capabilities of current AI systems.
The study highlights implications for industry adoption of AI and economic policy regarding labor market effects.

Hasty Briefsbeta