Crawl Order and Disorder

a year ago

Search engine crawler takes a long time to finish, with 99.9% of crawling done in 4 days and the remaining 0.1% taking a week.
Memory requirements dropped by 80% after migrating to slop crawl data, allowing more crawling tasks.
Crawler limits concurrent tasks per domain to avoid exceeding crawl rates and getting blocked by anti-crawler software.
Academic domains often have restrictive crawl limits due to large sizes and many subdomains.
Original random crawl order caused larger domains (often academic) to start late, delaying completion.
Attempt to sort by subdomains backfired, causing excessive simultaneous requests to blog hosts.
Added jitter to request delays and revised sorting to prioritize domains with >8 subdomains, improving scheduling.
Future optimization could use historical crawl time or on-disk data size to better prioritize tasks.

Hasty Briefsbeta