ChatGPT Health performance in a structured test of triage recommendations - PubMed
4 hours ago
- #Medical Safety
- #ChatGPT Health
- #AI Triage
- ChatGPT Health, launched in January 2026, was tested for triage recommendations using 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions (960 total responses).
- Performance showed an inverted U-shaped pattern, with the most dangerous failures at clinical extremes: non-urgent presentations (35%) and emergency conditions (48%).
- The system under-triaged 52% of gold-standard emergencies, directing patients with diabetic ketoacidosis and impending respiratory failure to 24-48-hour evaluation instead of emergency departments.
- Classical emergencies like stroke and anaphylaxis were correctly triaged.
- Anchoring bias (when family or friends minimized symptoms) significantly shifted triage recommendations in edge cases (OR 11.7, 95% CI 3.7-36.6), mostly toward less urgent care.
- Crisis intervention messages activated unpredictably in suicidal ideation cases, firing more when no specific method was described.
- Patient race, gender, and barriers to care showed no significant effects, though confidence intervals did not exclude clinically meaningful differences.
- Findings highlight missed high-risk emergencies and inconsistent crisis safeguard activation, raising safety concerns for AI triage systems before large-scale deployment.