Hasty Briefsbeta

Bilingual

Crawling a billion web pages in just over 24 hours, in 2025

2 days ago
  • #big-data
  • #web-crawling
  • #performance-optimization
  • Crawled 1.005 billion web pages in 25.5 hours for $462.
  • Used a cluster of 12 i7i.4xlarge machines with optimized fetcher and parser processes.
  • Parsing was a significant bottleneck due to larger average web page sizes.
  • Switched from lxml to selectolax for faster HTML parsing.
  • Network bandwidth was not a bottleneck; CPU was, especially due to SSL handshakes.
  • Memory growth from frontier data caused issues during the crawl.
  • Followed politeness protocols like respecting robots.txt and maintaining crawl delays.
  • Compared findings with prior crawls, noting improvements and new challenges.
  • Discussed the evolving web landscape and the impact of AI on crawling.