General-purpose large language models outperform specialized clinical AI tools on medical benchmarks - PubMed
3 hours ago
- #large language models
- #clinical AI tools
- #medical benchmarks
- General-purpose large language models (LLMs) like GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 outperform specialized clinical AI tools (OpenEvidence and UpToDate Expert AI) in three stages of evaluation.
- The evaluation included 500 MedQA questions testing medical knowledge, 500 HealthBench items measuring alignment with clinicians, and a real clinical queries (RCQ) benchmark with 100 de-identified physician queries.
- Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ benchmark, with 12 US clinicians conducting blinded reviews to produce 1,800 model-question annotations.
- The findings underscore the need for independent, real-world evaluation of AI tools before they are integrated into clinical settings to ensure effectiveness and safety.