PostgreSQL production incident caused by transaction ID wraparound
5 hours ago
- #Database Maintenance
- #PostgreSQL
- #Transaction ID Wraparound
- PostgreSQL transaction ID wraparound is a silent but severe failure mode that can cause a complete write outage.
- It occurs when freezing of old row versions doesn't happen, allowing transaction IDs to approach a hard safety limit.
- The failure is not immediate; it can manifest after months or years of normal operation with no warning signs.
- In the described incident, disabling autovacuum on tables, including unused ones, led to transaction ID aging.
- Once the limit is reached, PostgreSQL intentionally blocks all writes to prevent data corruption, making the database read-only.
- Recovery requires manual VACUUM FREEZE operations to advance transaction IDs beyond the danger zone.
- Transaction ID consumption depends on write rate and time; even modest workloads can lead to risk after about 7.5 months.
- Unlike PostgreSQL, SQL Server avoids this issue by using Log Sequence Numbers (LSNs) instead of finite, reusable transaction IDs.
- Understanding transaction ID math and treating autovacuum as a safety mechanism is crucial for PostgreSQL production stability.