When Sigterm Does Nothing: A Postgres Mystery
10 months ago
- #Debugging
- #Postgres
- #OpenSource
- The worst bugs are those ignored initially, only to resurface later causing frustration.
- ClickPipes encountered a critical bug with logical replication slot creation on Postgres read replicas, leading to unkillable queries.
- The issue manifested when creating a replication slot on a standby, waiting indefinitely for a transaction to complete on the primary.
- Investigation revealed the bug was due to an inefficient polling loop in Postgres's `XactLockTableWait` function on standbys.
- A patch was submitted and accepted by the Postgres community, adding interrupt checks to resolve the unkillable query issue.
- Further improvements, like better wait event reporting and efficient waiting mechanisms, are in progress for future Postgres releases.
- The experience highlights the importance of persistence in debugging and the value of open-source contributions.