Hasty Briefsbeta

Bilingual

News publishers limit Internet Archive access due to AI scraping concerns

5 hours ago
  • #Internet Archive
  • #News Publishers
  • #AI Scraping
  • The Internet Archive's Wayback Machine is being used by AI companies to scrape content, leading news publishers like The Guardian to limit access.
  • The Guardian has excluded its articles from the Internet Archive's APIs and Wayback Machine URLs to prevent AI scraping.
  • Other publishers, including The Financial Times and The New York Times, have also blocked the Internet Archive's crawlers.
  • Reddit has restricted the Internet Archive's access to its data due to concerns about AI companies violating platform policies.
  • The Internet Archive is implementing measures like rate-limiting and filtering to restrict bulk access to its libraries.
  • Evidence shows the Wayback Machine has been used in training datasets for AI models like Google's T5 and Meta's Llama.
  • Many news publishers are disallowing Internet Archive bots in their robots.txt files, with Gannett-owned sites leading the trend.
  • The Internet Archive remains a critical resource for preserving digital content, despite concerns about misuse by AI companies.