Hasty Briefsbeta

Bilingual

General-purpose large language models outperform specialized clinical AI tools on medical benchmarks - PubMed

2 hours ago
  • #large language models
  • #clinical AI tools
  • #medical benchmarks
  • General-purpose large language models (LLMs) like GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 outperform specialized clinical AI tools (OpenEvidence and UpToDate Expert AI) in three stages of evaluation.
  • The evaluation included 500 MedQA questions testing medical knowledge, 500 HealthBench items measuring alignment with clinicians, and a real clinical queries (RCQ) benchmark with 100 de-identified physician queries.
  • Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ benchmark, with 12 US clinicians conducting blinded reviews to produce 1,800 model-question annotations.
  • The findings underscore the need for independent, real-world evaluation of AI tools before they are integrated into clinical settings to ensure effectiveness and safety.