Beyond indexes: How open table formats optimize query performance
4 days ago
- #database-performance
- #data-engineering
- #open-table-formats
- The author's career in data started with SQL Server performance specialization, focusing on indexes, locking, and query design.
- Open table formats like Apache Iceberg differ from traditional RDBMS in their approach to indexing and performance optimization.
- Secondary indexes in RDBMS (like B-trees) are not present in open table formats (Iceberg, Delta Lake, Hudi) due to differing workload needs.
- Analytical workloads prioritize data scanning and pruning over point lookups, making traditional secondary indexes inefficient.
- Data organization in open table formats relies on partitioning, sorting, and auxiliary structures like Bloom filters for performance.
- Columnar storage and metadata (min/max statistics) in formats like Iceberg enable efficient data skipping (pruning).
- Materialized views in open table formats serve a similar purpose to secondary indexes in RDBMS but are tailored for analytical queries.
- Future evolution of open table formats may include richer metadata and standardized sidecar files for better performance optimization.
- The term 'index' in open table formats loosely refers to structures like Bloom filters and column statistics rather than traditional B-trees.