Postgres Postmaster does not scale
3 months ago
- #Performance
- #Postgres
- #Scalability
- Recall.ai processes millions of meetings weekly, automating tasks like meeting notes and CRM updates.
- Meeting synchronization (starting on the hour or half-hour) impacts media processing infrastructure.
- High load spikes from meeting starts require immediate compute capacity to avoid data loss.
- Postgres's postmaster process, a single-threaded loop, became a bottleneck during high connection rates.
- Delays in postgres connection establishment (10-15s) were traced to postmaster CPU saturation.
- Investigations revealed the postmaster's fork operations were expensive, especially under high churn.
- Enabling huge pages in Linux reduced PTE overhead, increasing connection throughput by 20%.
- Background workers for parallel queries added stress to the postmaster, exacerbating delays.
- Solutions included adding jitter to EC2 instance startups and reducing parallel query bursts.
- The postmaster's single-threaded nature is a fundamental bottleneck in high-scale Postgres deployments.