ChatGPT Health performance in a structured test of triage recommendations
4 hours ago
- #AI in healthcare
- #triage systems
- #patient safety
- ChatGPT Health, launched in January 2026, is OpenAI’s consumer health tool with millions of users.
- A structured stress test evaluated triage recommendations using 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions (960 total responses).
- Performance showed an inverted U-shaped pattern, with the most dangerous failures at clinical extremes: non-urgent presentations (35%) and emergency conditions (48%).
- The system under-triaged 52% of gold-standard emergencies, misdirecting cases like diabetic ketoacidosis and impending respiratory failure to delayed evaluation instead of emergency care.
- Anchoring bias (family/friends minimizing symptoms) significantly shifted triage recommendations in edge cases (OR 11.7), mostly toward less urgent care.
- Crisis intervention for suicidal ideation activated unpredictably—more frequently when no specific method was described.
- Patient demographics (race, gender, barriers to care) showed no significant effects, though confidence intervals didn’t rule out meaningful differences.
- Findings highlight missed high-risk emergencies and inconsistent crisis safeguards, raising safety concerns for AI triage systems at scale.