A major AI training data set contains millions of examples of personal data
9 months ago
- #Data Privacy
- #AI Ethics
- #Machine Learning
- Major AI training dataset DataComp CommonPool contains millions of personal data examples, including passports, credit cards, and birth certificates.
- Researchers estimate hundreds of millions of images with personally identifiable information (PII) exist in the dataset, based on a 0.1% audit.
- The dataset includes sensitive information like disability status, background checks, and home addresses from résumés linked to real people.
- DataComp CommonPool, with 12.8 billion samples, is used to train generative AI models and has been downloaded over 2 million times.
- Privacy measures like face blurring were found ineffective, with millions of faces missed by the algorithm.
- Web scraping practices raise ethical concerns, as individuals cannot consent to their data being used for AI training.
- Children's personal information was also found in the dataset, shared originally for limited purposes.
- Current privacy laws like GDPR and CCPA may not fully protect against misuse of publicly available data in AI training sets.
- The research calls for a reevaluation of indiscriminate web scraping and highlights the limitations of existing privacy protections.