Hasty Briefsbeta

Bilingual

General-purpose large language models outperform specialized clinical AI tools

4 hours ago
  • #Large Language Models
  • #Clinical Evaluation
  • #AI in Healthcare
  • Clinical AI tools, like OpenEvidence and UpToDate Expert AI, underperform frontier general-purpose large language models (LLMs) on medical benchmarks.
  • General-purpose LLMs (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) outperformed clinical AI tools across MedQA, HealthBench, and real clinical queries (RCQ).
  • In the RCQ benchmark, frontier LLMs formed a higher performance tier, while clinical AI tools performed similarly to Google Search AI Overview.
  • Clinical AI tools showed lower clarity scores, particularly OpenEvidence, and UpToDate had a higher refusal rate on queries.
  • No significant differences were found among models in producing harmful content or hallucinations in the RCQ evaluation.
  • The study highlights the need for independent real-world evaluation before AI tools are integrated into clinical settings.
  • Limitations include potential bias from industry-created benchmarks and differences in API versus browser-based querying.
  • Future evaluations should consider response latency, citation quality, and hospital-specific adaptations for clinical AI tools.