The strange case of the underestimated Merge Join node
4 months ago
- #Query Optimization
- #PostgreSQL
- #Performance Tuning
- A customer reported a query that was slow on first execution after a batch process but fast on subsequent runs, with different execution plans.
- Initial suspicion was autoanalyze not running post-batch, but tables weren't vacuumed or analyzed between executions.
- The query involved a LEFT JOIN between two tables with non-overlapping histograms for the join columns.
- First execution used a Merge Join with high cost due to planner's estimation based on incomplete histogram data.
- Second execution used a Nested Loop Join, benefiting from cleaned index tuples providing accurate extreme values.
- The planner's get_actual_variable_endpoint() function aborts early on first execution, leading to inaccurate estimates.
- A script was provided to reproduce the scenario, demonstrating the issue with dead heap tuples affecting index scans.
- The case highlights a rare scenario where query plans change without data or statistics updates, due to index tuple cleanup.