General-purpose large language models outperform specialized clinical AI tools
6 hours ago
- #Large Language Models
- #Clinical Evaluation
- #AI in Healthcare
- Clinical AI tools, like OpenEvidence and UpToDate Expert AI, underperform frontier general-purpose large language models (LLMs) on medical benchmarks.
- General-purpose LLMs (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) outperformed clinical AI tools across MedQA, HealthBench, and real clinical queries (RCQ).
- In the RCQ benchmark, frontier LLMs formed a higher performance tier, while clinical AI tools performed similarly to Google Search AI Overview.
- Clinical AI tools showed lower clarity scores, particularly OpenEvidence, and UpToDate had a higher refusal rate on queries.
- No significant differences were found among models in producing harmful content or hallucinations in the RCQ evaluation.
- The study highlights the need for independent real-world evaluation before AI tools are integrated into clinical settings.
- Limitations include potential bias from industry-created benchmarks and differences in API versus browser-based querying.
- Future evaluations should consider response latency, citation quality, and hospital-specific adaptations for clinical AI tools.