Science Board: Evaluating Agents in Realistic Scientific Workflows
a year ago
- #Scientific Workflows
- #Autonomous Agents
- #Artificial Intelligence
- Large Language Models (LLMs) are expanding beyond Natural Language Processing, aiding interdisciplinary research.
- LLM-based agents, especially computer-using ones, are automating scientific workflows by interacting with operating systems.
- ScienceBoard is introduced with two main contributions: a realistic multi-domain environment for autonomous agents and a benchmark of 169 real-world scientific tasks.
- The benchmark spans domains like biochemistry, astronomy, and geoinformatics, validated for real-world applicability.
- Evaluations show current agents (e.g., GPT-4o, Claude 3.7) achieve only a 15% success rate in complex workflows.
- Insights from the study highlight limitations and design principles for future scientific discovery agents.
- Resources including code, environment, and benchmark are made available for further development.