Hasty Briefsbeta

Beyond indexes: How open table formats optimize query performance

4 days ago
  • #database-performance
  • #data-engineering
  • #open-table-formats
  • The author's career in data started with SQL Server performance specialization, focusing on indexes, locking, and query design.
  • Open table formats like Apache Iceberg differ from traditional RDBMS in their approach to indexing and performance optimization.
  • Secondary indexes in RDBMS (like B-trees) are not present in open table formats (Iceberg, Delta Lake, Hudi) due to differing workload needs.
  • Analytical workloads prioritize data scanning and pruning over point lookups, making traditional secondary indexes inefficient.
  • Data organization in open table formats relies on partitioning, sorting, and auxiliary structures like Bloom filters for performance.
  • Columnar storage and metadata (min/max statistics) in formats like Iceberg enable efficient data skipping (pruning).
  • Materialized views in open table formats serve a similar purpose to secondary indexes in RDBMS but are tailored for analytical queries.
  • Future evolution of open table formats may include richer metadata and standardized sidecar files for better performance optimization.
  • The term 'index' in open table formats loosely refers to structures like Bloom filters and column statistics rather than traditional B-trees.