Web-scraping AI bots cause disruption for scientific databases and journals
a year ago
- #AI
- #Publishing
- #Bots
- DiscoverLife, an online image repository, experienced a surge in traffic due to bots, slowing down the site.
- Bots are increasingly problematic for scholarly publishers and researchers, scraping content for generative AI training.
- Many suspect bots are gathering data to train AI tools like chatbots and image generators.
- The high volume of bot requests strains systems, causing financial and operational disruptions.
- Smaller organizations with limited resources are particularly vulnerable to these disruptions.
- Internet bots have been around for decades, with some being useful, like those used by search engines.
- The rise of generative AI has led to an increase in 'bad' bots that scrape content without permission.
- Publishers like BMJ and Highwire Press report significant increases in 'bad bot' traffic, causing service disruptions.
- COAR reported that over 90% of surveyed members experienced AI bots scraping content, with two-thirds facing service disruptions.
- The release of DeepSeek, a Chinese-built LLM, showed that powerful AI models could be made with fewer resources, leading to more bots scraping training data.