Chemical knowledge and reasoning of large language models vs. chemist expertise

a year ago

Large language models (LLMs) demonstrate impressive capabilities in processing human language and performing tasks beyond their explicit training.
ChemBench is introduced as an automated framework to evaluate the chemical knowledge and reasoning abilities of LLMs against human chemists.
The study curated over 2,700 question-answer pairs and found that leading LLMs outperformed human chemists on average, though they struggle with basic tasks and provide overconfident predictions.
LLMs show potential in chemistry applications, such as predicting molecular properties, optimizing reactions, and generating materials, but concerns about dual-use risks (e.g., chemical weapon design) persist.
The performance of LLMs varies across chemical subfields, excelling in general chemistry but struggling with topics like toxicity and safety or analytical chemistry.
Models exhibit limitations in reasoning about molecular structures and estimating their own confidence, highlighting the need for improved human-model interaction frameworks.
The findings suggest a need to rethink chemistry education, emphasizing critical reasoning over rote memorization, given LLMs' capabilities.
ChemBench provides a nuanced understanding of LLMs' chemical capabilities, serving as a benchmark for future improvements in safety and usefulness.

Hasty Briefsbeta