TLA+ Caught a Silent Data Divergence Bug in Postgres's pg_rewind
11 hours ago
- #Postgres
- #Data Integrity
- #High Availability
- Postgres streaming replication enables high availability by replicating write-ahead logs (WAL) to standby servers, allowing promotion when the primary fails.
- pg_rewind solves data divergence by comparing the old primary's data with the new primary's, rewinding the old primary to a common point.
- Timeline IDs (TLI) track promotions, with history files recording lineage, but identical TLI and LSN can mislead pg_rewind.
- A bug occurs when two crashes cause independent promotions to the same TLI, leading pg_rewind to incorrectly skip rewinding.
- This results in silent data divergence, where a standby retains unreplicated writes, risking phantom data reappearance.
- TLA+ modeling identified this bug by exhaustively checking states, confirming the violation of storage consistency invariants.
- The fix introduces promotion UUIDs (UUIDv7) in timeline history files to uniquely identify each promotion.
- UUID comparison in pg_rewind detects independent promotions, forcing rewinds when UUIDs differ, even with matching TLIs.
- Backward compatibility is maintained, with missing UUIDs treated as compatible, ensuring graceful degradation.
- Demonstration shows the bug and fix via a test setup, highlighting how UUIDs prevent silent data loss.