Web-scraping AI bots cause disruption for scientific databases and journals

a year ago

DiscoverLife, an online image repository, experienced a surge in traffic due to bots, slowing down the site.
Bots are increasingly problematic for scholarly publishers and researchers, scraping content for generative AI training.
Many suspect bots are gathering data to train AI tools like chatbots and image generators.
The high volume of bot requests strains systems, causing financial and operational disruptions.
Smaller organizations with limited resources are particularly vulnerable to these disruptions.
Internet bots have been around for decades, with some being useful, like those used by search engines.
The rise of generative AI has led to an increase in 'bad' bots that scrape content without permission.
Publishers like BMJ and Highwire Press report significant increases in 'bad bot' traffic, causing service disruptions.
COAR reported that over 90% of surveyed members experienced AI bots scraping content, with two-thirds facing service disruptions.
The release of DeepSeek, a Chinese-built LLM, showed that powerful AI models could be made with fewer resources, leading to more bots scraping training data.

Hasty Briefsbeta