The strange case of the underestimated Merge Join node

4 months ago

A customer reported a query that was slow on first execution after a batch process but fast on subsequent runs, with different execution plans.
Initial suspicion was autoanalyze not running post-batch, but tables weren't vacuumed or analyzed between executions.
The query involved a LEFT JOIN between two tables with non-overlapping histograms for the join columns.
First execution used a Merge Join with high cost due to planner's estimation based on incomplete histogram data.
Second execution used a Nested Loop Join, benefiting from cleaned index tuples providing accurate extreme values.
The planner's get_actual_variable_endpoint() function aborts early on first execution, leading to inaccurate estimates.
A script was provided to reproduce the scenario, demonstrating the issue with dead heap tuples affecting index scans.
The case highlights a rare scenario where query plans change without data or statistics updates, due to index tuple cleanup.

Hasty Briefsbeta