News publishers limit Internet Archive access due to AI scraping concerns

5 hours ago

The Internet Archive's Wayback Machine is being used by AI companies to scrape content, leading news publishers like The Guardian to limit access.
The Guardian has excluded its articles from the Internet Archive's APIs and Wayback Machine URLs to prevent AI scraping.
Other publishers, including The Financial Times and The New York Times, have also blocked the Internet Archive's crawlers.
Reddit has restricted the Internet Archive's access to its data due to concerns about AI companies violating platform policies.
The Internet Archive is implementing measures like rate-limiting and filtering to restrict bulk access to its libraries.
Evidence shows the Wayback Machine has been used in training datasets for AI models like Google's T5 and Meta's Llama.
Many news publishers are disallowing Internet Archive bots in their robots.txt files, with Gannett-owned sites leading the trend.
The Internet Archive remains a critical resource for preserving digital content, despite concerns about misuse by AI companies.

Hasty Briefsbeta