Space-efficient indexing for immutable log data
11 days ago
- indexing
- disk-backed hashmap
- high-cardinality
- Seq uses a disk-backed hashmap to index high-cardinality predicates efficiently
- Records in Seq are initially ingested into unordered write-ahead-logs based on timestamp
- Indexes in Seq reduce IO needed for queries and are computed after logs are coalesced
- Traditional bitmap indexing works well for low-cardinality properties but not for high-cardinality ones
- Disk-backed hashmap is used for high-cardinality indexes in Seq to store hashes and page hits
- Layout of the disk-backed hashmap on disk includes bucket offsets, keys, and values
- HashMap in-memory representation is built during indexing and formatted into its on-disk representation at the end
- Comparison between disk-backed hashmap, odht, and bincode in terms of size on disk and reading performance
- Disk-backed hashmap is memory-mapped directly from disk for indexing in Seq