Sparse File LRU Cache
5 days ago
- #file-systems
- #performance-optimization
- #data-caching
- Sparse files allow logical files with empty blocks not physically backed until written to.
- Columnar data formats store data contiguously, making them efficient for analytics queries.
- Amplitude uses sparse files to cache only necessary columns from S3 to local SSDs, optimizing storage and performance.
- Two initial caching strategies were considered: caching entire files (wasteful) or individual columns (metadata-heavy).
- Sparse files offer a middle ground, caching only used columns as physical blocks, reducing metadata and disk usage.
- Metadata on cached columns is managed via RocksDB, tracking block presence and last read times for LRU invalidation.
- Variable-sized logical blocks optimize reading by accommodating file format headers and column layouts.
- The sparse file LRU cache reduces S3 GETs, file system metadata, block overhead, and IOPS.