A major AI training data set contains millions of examples of personal data

10 months ago

Major AI training dataset DataComp CommonPool contains millions of personal data examples, including passports, credit cards, and birth certificates.
Researchers estimate hundreds of millions of images with personally identifiable information (PII) exist in the dataset, based on a 0.1% audit.
The dataset includes sensitive information like disability status, background checks, and home addresses from résumés linked to real people.
DataComp CommonPool, with 12.8 billion samples, is used to train generative AI models and has been downloaded over 2 million times.
Privacy measures like face blurring were found ineffective, with millions of faces missed by the algorithm.
Web scraping practices raise ethical concerns, as individuals cannot consent to their data being used for AI training.
Children's personal information was also found in the dataset, shared originally for limited purposes.
Current privacy laws like GDPR and CCPA may not fully protect against misuse of publicly available data in AI training sets.
The research calls for a reevaluation of indiscriminate web scraping and highlights the limitations of existing privacy protections.

Hasty Briefsbeta