Science Board: Evaluating Agents in Realistic Scientific Workflows

a year ago

Large Language Models (LLMs) are expanding beyond Natural Language Processing, aiding interdisciplinary research.
LLM-based agents, especially computer-using ones, are automating scientific workflows by interacting with operating systems.
ScienceBoard is introduced with two main contributions: a realistic multi-domain environment for autonomous agents and a benchmark of 169 real-world scientific tasks.
The benchmark spans domains like biochemistry, astronomy, and geoinformatics, validated for real-world applicability.
Evaluations show current agents (e.g., GPT-4o, Claude 3.7) achieve only a 15% success rate in complex workflows.
Insights from the study highlight limitations and design principles for future scientific discovery agents.
Resources including code, environment, and benchmark are made available for further development.

Hasty Briefsbeta