Hasty Briefsbeta

Bilingual

Chemical knowledge and reasoning of large language models vs. chemist expertise

a year ago
  • #LLMs
  • #Chemistry
  • #Benchmarking
  • Large language models (LLMs) demonstrate impressive capabilities in processing human language and performing tasks beyond their explicit training.
  • ChemBench is introduced as an automated framework to evaluate the chemical knowledge and reasoning abilities of LLMs against human chemists.
  • The study curated over 2,700 question-answer pairs and found that leading LLMs outperformed human chemists on average, though they struggle with basic tasks and provide overconfident predictions.
  • LLMs show potential in chemistry applications, such as predicting molecular properties, optimizing reactions, and generating materials, but concerns about dual-use risks (e.g., chemical weapon design) persist.
  • The performance of LLMs varies across chemical subfields, excelling in general chemistry but struggling with topics like toxicity and safety or analytical chemistry.
  • Models exhibit limitations in reasoning about molecular structures and estimating their own confidence, highlighting the need for improved human-model interaction frameworks.
  • The findings suggest a need to rethink chemistry education, emphasizing critical reasoning over rote memorization, given LLMs' capabilities.
  • ChemBench provides a nuanced understanding of LLMs' chemical capabilities, serving as a benchmark for future improvements in safety and usefulness.