Hasty Briefsbeta

  • #file-systems
  • #performance-optimization
  • #data-caching
  • Sparse files allow logical files with empty blocks not physically backed until written to.
  • Columnar data formats store data contiguously, making them efficient for analytics queries.
  • Amplitude uses sparse files to cache only necessary columns from S3 to local SSDs, optimizing storage and performance.
  • Two initial caching strategies were considered: caching entire files (wasteful) or individual columns (metadata-heavy).
  • Sparse files offer a middle ground, caching only used columns as physical blocks, reducing metadata and disk usage.
  • Metadata on cached columns is managed via RocksDB, tracking block presence and last read times for LRU invalidation.
  • Variable-sized logical blocks optimize reading by accommodating file format headers and column layouts.
  • The sparse file LRU cache reduces S3 GETs, file system metadata, block overhead, and IOPS.