The Nonprofit Feeding the Internet to AI Companies
6 months ago
- #AI Ethics
- #Data Scraping
- #Copyright
- Common Crawl, a nonprofit, has been scraping paywalled articles from major news websites and providing them to AI companies for training large language models (LLMs).
- Despite claiming to only scrape 'freely available content' and comply with removal requests, Common Crawl's archives still contain millions of paywalled articles from publishers like The New York Times, The Economist, and The Atlantic.
- Common Crawl's executive director, Rich Skrenta, argues that AI models should have free access to internet content, comparing robots to people who should 'read the books' for free.
- Publishers have requested content removal, but Common Crawl's archives remain largely unchanged, with removal claims (e.g., 50%, 70%, 80%) appearing misleading or false.
- Common Crawl's search tool returns incorrect 'no captures' results for domains like NYTimes.com, hiding the presence of paywalled content.
- AI companies like OpenAI, Google, and Nvidia rely on Common Crawl's data, which has been used to train models like GPT-3 and ChatGPT.
- Common Crawl has received donations from AI companies (e.g., $250,000 each from OpenAI and Anthropic) and actively assists in curating AI-training datasets.
- Skrenta dismisses concerns about publishers' rights, stating that content on the internet should be freely accessible and downplaying the value of original journalism.
- Critics argue that AI companies and Common Crawl exploit publishers' work, undermining their business models while claiming to uphold 'openness.'
- Common Crawl's actions highlight tensions between AI development, copyright, and the ethics of data scraping.