Crawling a billion web pages in just over 24 hours, in 2025

2 days ago

Crawled 1.005 billion web pages in 25.5 hours for $462.
Used a cluster of 12 i7i.4xlarge machines with optimized fetcher and parser processes.
Parsing was a significant bottleneck due to larger average web page sizes.
Switched from lxml to selectolax for faster HTML parsing.
Network bandwidth was not a bottleneck; CPU was, especially due to SSL handshakes.
Memory growth from frontier data caused issues during the crawl.
Followed politeness protocols like respecting robots.txt and maintaining crawl delays.
Compared findings with prior crawls, noting improvements and new challenges.
Discussed the evolving web landscape and the impact of AI on crawling.

Hasty Briefsbeta