General-purpose large language models outperform specialized clinical AI tools on medical benchmarks - PubMed

2 hours ago

General-purpose large language models (LLMs) like GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 outperform specialized clinical AI tools (OpenEvidence and UpToDate Expert AI) in three stages of evaluation.
The evaluation included 500 MedQA questions testing medical knowledge, 500 HealthBench items measuring alignment with clinicians, and a real clinical queries (RCQ) benchmark with 100 de-identified physician queries.
Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ benchmark, with 12 US clinicians conducting blinded reviews to produce 1,800 model-question annotations.
The findings underscore the need for independent, real-world evaluation of AI tools before they are integrated into clinical settings to ensure effectiveness and safety.

Hasty Briefsbeta