AI disease-prediction models were trained on dubious data
4 hours ago
- #Medical Research Ethics
- #Data Integrity
- #AI in Healthcare
- Researchers report in a preprint that dubious data sets are being used to train AI models for predicting stroke and diabetes risks, with some models potentially used in clinical settings, though unclear if flawed diagnoses resulted.
- Adrian Barnett and colleagues identified 124 peer-reviewed papers using two open-access health data sets with unclear origins; analysis revealed oddities suggesting data fabrication, raising concerns about reliability.
- At least two models from these data sets have been used in hospitals in Indonesia and Spain, one is in a 2024 medical-device patent, and two are public web tools for risk assessment.
- Experts like Soumyadeep Bhaumik warn that prediction models from provenance-unknown data are intrinsically unreliable and should not be used in clinical decision-making, risking inappropriate treatments.
- Institutions and funders are urged to require data source disclosure for AI models in medical applications, and journals should reject non-compliant papers; flagged data sets should be taken down.
- The two data sets are on Kaggle: 'Stroke Prediction Dataset' with 5,110 entries shows irregularities like few missing data points, used in 104 studies; 'Diabetes prediction data set' with 100,000 entries has implausible patterns like limited blood glucose values.
- Creators of the data sets, Federico Soriano Palacios and Mohammed Mustafa, cite confidentiality and refuse to disclose sources; Kaggle declined to comment on potential investigations or actions.
- The study highlights risks of using questionable data for AI in healthcare and calls for stricter oversight to prevent further unreliable research and clinical applications.