'Comically bad' datasets used to train clinical models for stroke and diabetes

a day ago

Researchers exposed a Kaggle dataset called 'droopy' used in a Scientific Reports paper for stroke prediction, containing duplicated celebrity images and inappropriate content like children's photos and Bell's palsy images.
A preprint study by Adrian Barnett and Alexander Gibson identified widespread issues with Kaggle datasets on stroke and diabetes, leading to retractions and investigations of papers using flawed data for clinical models.
Many papers using these datasets made clinical recommendations without ethics statements, with some models deployed in hospitals or linked to patents, despite lacking data provenance and reliability checks.
Publishers like Springer Nature and Elsevier are investigating multiple papers, with some retractions already issued due to inability to verify data accuracy or origin.
Kaggle's community-driven approach lacks robust verification, and while synthetic data is legitimate for benchmarking, misuse in medical research highlights systemic problems in data quality and academic incentives.

Hasty Briefsbeta