Crawl Order and Disorder
a year ago
- #web-crawling
- #optimization
- #search-engine
- Search engine crawler takes a long time to finish, with 99.9% of crawling done in 4 days and the remaining 0.1% taking a week.
- Memory requirements dropped by 80% after migrating to slop crawl data, allowing more crawling tasks.
- Crawler limits concurrent tasks per domain to avoid exceeding crawl rates and getting blocked by anti-crawler software.
- Academic domains often have restrictive crawl limits due to large sizes and many subdomains.
- Original random crawl order caused larger domains (often academic) to start late, delaying completion.
- Attempt to sort by subdomains backfired, causing excessive simultaneous requests to blog hosts.
- Added jitter to request delays and revised sorting to prioritize domains with >8 subdomains, improving scheduling.
- Future optimization could use historical crawl time or on-disk data size to better prioritize tasks.