News publishers limit Internet Archive access due to AI scraping concerns
5 hours ago
- #Internet Archive
- #News Publishers
- #AI Scraping
- The Internet Archive's Wayback Machine is being used by AI companies to scrape content, leading news publishers like The Guardian to limit access.
- The Guardian has excluded its articles from the Internet Archive's APIs and Wayback Machine URLs to prevent AI scraping.
- Other publishers, including The Financial Times and The New York Times, have also blocked the Internet Archive's crawlers.
- Reddit has restricted the Internet Archive's access to its data due to concerns about AI companies violating platform policies.
- The Internet Archive is implementing measures like rate-limiting and filtering to restrict bulk access to its libraries.
- Evidence shows the Wayback Machine has been used in training datasets for AI models like Google's T5 and Meta's Llama.
- Many news publishers are disallowing Internet Archive bots in their robots.txt files, with Gannett-owned sites leading the trend.
- The Internet Archive remains a critical resource for preserving digital content, despite concerns about misuse by AI companies.