Large Feeds and RFC 5005
4 months ago
- #Performance Optimization
- #Feed Processing
- #SQLite
- Xobaque is being used to back Search indieblog.page, importing nearly 5000 feeds.
- A single SQLite writer is a bottleneck due to the inefficiency of 'select foo' if found 'update foo' else 'insert foo'.
- UPSERT cannot be used because full-text search is implemented via a virtual table that doesn't allow constraints or indexes.
- Some blogs have feeds with an excessive number of pages (e.g., 12000, 4000, 2000), which seems unnecessary for update feeds.
- RFC 5005 is mentioned as a solution for feed paging and archiving.
- Current architecture involves 10 Go routines fetching feeds using If-Modified-Since and If-None-Match headers, skipping 304 responses.
- A lock is acquired before writing to disk to ensure only one SQLite writer at a time, which is slow.
- Performance improvements were made by splitting select, insert, and updates into chunks of 1000 items, reducing runtime to about 12 hours.
- Potential next steps include using global variables for prepared statements and filtering feeds based on Last-Modified headers to skip unchanged pages.
- A log analysis showed 1782 feeds processed with SQL and 2179 skipped due to HTTP caching, with most pages unchanged.