Beyond indexes: How open table formats optimize query performance

4 days ago

Copy Link

The author's career in data started with SQL Server performance specialization, focusing on indexes, locking, and query design.
Open table formats like Apache Iceberg differ from traditional RDBMS in their approach to indexing and performance optimization.
Secondary indexes in RDBMS (like B-trees) are not present in open table formats (Iceberg, Delta Lake, Hudi) due to differing workload needs.
Analytical workloads prioritize data scanning and pruning over point lookups, making traditional secondary indexes inefficient.
Data organization in open table formats relies on partitioning, sorting, and auxiliary structures like Bloom filters for performance.
Columnar storage and metadata (min/max statistics) in formats like Iceberg enable efficient data skipping (pruning).
Materialized views in open table formats serve a similar purpose to secondary indexes in RDBMS but are tailored for analytical queries.
Future evolution of open table formats may include richer metadata and standardized sidecar files for better performance optimization.
The term 'index' in open table formats loosely refers to structures like Bloom filters and column statistics rather than traditional B-trees.

Hasty Briefsbeta