US publishers tell Common Crawl to stop scraping and delete archive
5 hours ago
- #Copyright Infringement
- #AI Training Data
- #Web Scraping
- Digital news publishers in the US are legally challenging Common Crawl's scraping of copyrighted, paywalled, and subscriber-only content.
- Common Crawl, which creates free web archives used to train AI models, has been accused of not properly complying with opt-out requests and removing content.
- The copyright lawsuit by The New York Times against OpenAI highlighted Common Crawl's role, with 60% of GPT-3 training data coming from its datasets.
- Common Crawl denies lying to publishers or scraping behind paywalls, but admits removal processes are complex and not instantaneous.
- Publishers argue Common Crawl has 'flagrantly infringed' copyright by distributing datasets to AI companies for commercial purposes without permission or compensation.
- Common Crawl is funded by foundations and donations from AI companies, and is seen as crucial to the development of generative AI models.