RareArena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis - PubMed
5 hours ago
- #rare diseases
- #medical diagnosis
- #large language models
- RareArena is a benchmark dataset designed to evaluate large language models (LLMs) in rare disease diagnosis.
- The dataset addresses gaps in existing evaluations by offering high sample sizes, broad disease coverage, and clinical relevance.
- Two tasks were created: Rare Disease Screening (RDS) with 49,760 cases and Rare Disease Confirmation (RDC) with 22,901 cases.
- Human evaluations confirmed the dataset's high quality in terms of leakage, fidelity, and complexity.
- Ten state-of-the-art LLMs were benchmarked, with GPT-4o achieving the best performance in both RDS and RDC tasks.
- GPT-4o performed better on genetically inherited diseases and excelled in systemic or rheumatologic diseases.
- RareArena is the largest rare disease diagnostic benchmark to date, supporting improved global care for rare diseases.