Crawling a billion web pages in just over 24 hours, in 2025
2 days ago
- #big-data
- #web-crawling
- #performance-optimization
- Crawled 1.005 billion web pages in 25.5 hours for $462.
- Used a cluster of 12 i7i.4xlarge machines with optimized fetcher and parser processes.
- Parsing was a significant bottleneck due to larger average web page sizes.
- Switched from lxml to selectolax for faster HTML parsing.
- Network bandwidth was not a bottleneck; CPU was, especially due to SSL handshakes.
- Memory growth from frontier data caused issues during the crawl.
- Followed politeness protocols like respecting robots.txt and maintaining crawl delays.
- Compared findings with prior crawls, noting improvements and new challenges.
- Discussed the evolving web landscape and the impact of AI on crawling.