General-purpose large language models outperform specialized clinical AI tools

4 hours ago

Clinical AI tools, like OpenEvidence and UpToDate Expert AI, underperform frontier general-purpose large language models (LLMs) on medical benchmarks.
General-purpose LLMs (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) outperformed clinical AI tools across MedQA, HealthBench, and real clinical queries (RCQ).
In the RCQ benchmark, frontier LLMs formed a higher performance tier, while clinical AI tools performed similarly to Google Search AI Overview.
Clinical AI tools showed lower clarity scores, particularly OpenEvidence, and UpToDate had a higher refusal rate on queries.
No significant differences were found among models in producing harmful content or hallucinations in the RCQ evaluation.
The study highlights the need for independent real-world evaluation before AI tools are integrated into clinical settings.
Limitations include potential bias from industry-created benchmarks and differences in API versus browser-based querying.
Future evaluations should consider response latency, citation quality, and hospital-specific adaptations for clinical AI tools.

Hasty Briefsbeta