Hasty Briefsbeta

Bilingual

Crawl Order and Disorder

a year ago
  • #web-crawling
  • #optimization
  • #search-engine
  • Search engine crawler takes a long time to finish, with 99.9% of crawling done in 4 days and the remaining 0.1% taking a week.
  • Memory requirements dropped by 80% after migrating to slop crawl data, allowing more crawling tasks.
  • Crawler limits concurrent tasks per domain to avoid exceeding crawl rates and getting blocked by anti-crawler software.
  • Academic domains often have restrictive crawl limits due to large sizes and many subdomains.
  • Original random crawl order caused larger domains (often academic) to start late, delaying completion.
  • Attempt to sort by subdomains backfired, causing excessive simultaneous requests to blog hosts.
  • Added jitter to request delays and revised sorting to prioritize domains with >8 subdomains, improving scheduling.
  • Future optimization could use historical crawl time or on-disk data size to better prioritize tasks.