CMU TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
a year ago
- #AI Agents
- #Work Automation
- #Benchmarking
- TheAgentCompany is introduced as an extensible benchmark for evaluating AI agents on real-world professional tasks.
- AI agents are tested in a simulated small software company environment, mimicking tasks like web browsing, coding, and communication.
- Baseline agents powered by both closed API-based and open-weights language models (LMs) are evaluated.
- The most competitive agent can autonomously complete 24% of tasks, indicating potential for simpler task automation.
- More complex, long-horizon tasks remain beyond the capabilities of current AI systems.
- The study highlights implications for industry adoption of AI and economic policy regarding labor market effects.