Hasty Briefsbeta

Bilingual

A major AI training data set contains millions of examples of personal data

9 months ago
  • #Data Privacy
  • #AI Ethics
  • #Machine Learning
  • Major AI training dataset DataComp CommonPool contains millions of personal data examples, including passports, credit cards, and birth certificates.
  • Researchers estimate hundreds of millions of images with personally identifiable information (PII) exist in the dataset, based on a 0.1% audit.
  • The dataset includes sensitive information like disability status, background checks, and home addresses from résumés linked to real people.
  • DataComp CommonPool, with 12.8 billion samples, is used to train generative AI models and has been downloaded over 2 million times.
  • Privacy measures like face blurring were found ineffective, with millions of faces missed by the algorithm.
  • Web scraping practices raise ethical concerns, as individuals cannot consent to their data being used for AI training.
  • Children's personal information was also found in the dataset, shared originally for limited purposes.
  • Current privacy laws like GDPR and CCPA may not fully protect against misuse of publicly available data in AI training sets.
  • The research calls for a reevaluation of indiscriminate web scraping and highlights the limitations of existing privacy protections.