US publishers tell Common Crawl to stop scraping and delete archive

4 hours ago

Digital news publishers in the US are legally challenging Common Crawl's scraping of copyrighted, paywalled, and subscriber-only content.
Common Crawl, which creates free web archives used to train AI models, has been accused of not properly complying with opt-out requests and removing content.
The copyright lawsuit by The New York Times against OpenAI highlighted Common Crawl's role, with 60% of GPT-3 training data coming from its datasets.
Common Crawl denies lying to publishers or scraping behind paywalls, but admits removal processes are complex and not instantaneous.
Publishers argue Common Crawl has 'flagrantly infringed' copyright by distributing datasets to AI companies for commercial purposes without permission or compensation.
Common Crawl is funded by foundations and donations from AI companies, and is seen as crucial to the development of generative AI models.

Hasty Briefsbeta