Hasty Briefsbeta

Bilingual

When Sigterm Does Nothing: A Postgres Mystery

10 months ago
  • #Debugging
  • #Postgres
  • #OpenSource
  • The worst bugs are those ignored initially, only to resurface later causing frustration.
  • ClickPipes encountered a critical bug with logical replication slot creation on Postgres read replicas, leading to unkillable queries.
  • The issue manifested when creating a replication slot on a standby, waiting indefinitely for a transaction to complete on the primary.
  • Investigation revealed the bug was due to an inefficient polling loop in Postgres's `XactLockTableWait` function on standbys.
  • A patch was submitted and accepted by the Postgres community, adding interrupt checks to resolve the unkillable query issue.
  • Further improvements, like better wait event reporting and efficient waiting mechanisms, are in progress for future Postgres releases.
  • The experience highlights the importance of persistence in debugging and the value of open-source contributions.