RareArena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis - PubMed

2 months ago

RareArena is a benchmark dataset designed to evaluate large language models (LLMs) in rare disease diagnosis.
The dataset addresses gaps in existing evaluations by offering high sample sizes, broad disease coverage, and clinical relevance.
Two tasks were created: Rare Disease Screening (RDS) with 49,760 cases and Rare Disease Confirmation (RDC) with 22,901 cases.
Human evaluations confirmed the dataset's high quality in terms of leakage, fidelity, and complexity.
Ten state-of-the-art LLMs were benchmarked, with GPT-4o achieving the best performance in both RDS and RDC tasks.
GPT-4o performed better on genetically inherited diseases and excelled in systemic or rheumatologic diseases.
RareArena is the largest rare disease diagnostic benchmark to date, supporting improved global care for rare diseases.

Hasty Briefsbeta