Hasty Briefsbeta

Bilingual

Science Board: Evaluating Agents in Realistic Scientific Workflows

a year ago
  • #Scientific Workflows
  • #Autonomous Agents
  • #Artificial Intelligence
  • Large Language Models (LLMs) are expanding beyond Natural Language Processing, aiding interdisciplinary research.
  • LLM-based agents, especially computer-using ones, are automating scientific workflows by interacting with operating systems.
  • ScienceBoard is introduced with two main contributions: a realistic multi-domain environment for autonomous agents and a benchmark of 169 real-world scientific tasks.
  • The benchmark spans domains like biochemistry, astronomy, and geoinformatics, validated for real-world applicability.
  • Evaluations show current agents (e.g., GPT-4o, Claude 3.7) achieve only a 15% success rate in complex workflows.
  • Insights from the study highlight limitations and design principles for future scientific discovery agents.
  • Resources including code, environment, and benchmark are made available for further development.