Hasty Briefsbeta

Bilingual

US publishers tell Common Crawl to stop scraping and delete archive

4 hours ago
  • #Copyright Infringement
  • #AI Training Data
  • #Web Scraping
  • Digital news publishers in the US are legally challenging Common Crawl's scraping of copyrighted, paywalled, and subscriber-only content.
  • Common Crawl, which creates free web archives used to train AI models, has been accused of not properly complying with opt-out requests and removing content.
  • The copyright lawsuit by The New York Times against OpenAI highlighted Common Crawl's role, with 60% of GPT-3 training data coming from its datasets.
  • Common Crawl denies lying to publishers or scraping behind paywalls, but admits removal processes are complex and not instantaneous.
  • Publishers argue Common Crawl has 'flagrantly infringed' copyright by distributing datasets to AI companies for commercial purposes without permission or compensation.
  • Common Crawl is funded by foundations and donations from AI companies, and is seen as crucial to the development of generative AI models.