Hasty Briefsbeta

Space-efficient indexing for immutable log data

11 days ago
  • indexing
  • disk-backed hashmap
  • high-cardinality
  • Seq uses a disk-backed hashmap to index high-cardinality predicates efficiently
  • Records in Seq are initially ingested into unordered write-ahead-logs based on timestamp
  • Indexes in Seq reduce IO needed for queries and are computed after logs are coalesced
  • Traditional bitmap indexing works well for low-cardinality properties but not for high-cardinality ones
  • Disk-backed hashmap is used for high-cardinality indexes in Seq to store hashes and page hits
  • Layout of the disk-backed hashmap on disk includes bucket offsets, keys, and values
  • HashMap in-memory representation is built during indexing and formatted into its on-disk representation at the end
  • Comparison between disk-backed hashmap, odht, and bincode in terms of size on disk and reading performance
  • Disk-backed hashmap is memory-mapped directly from disk for indexing in Seq