We Uncovered a Race Condition in Aurora RDS

8 days ago

Copy Link

AWS outage on October 20th caused by a race condition bug in a DNS management service.
Hightouch Events system architecture relies on Kubernetes, Kafka, and Postgres.
During the AWS outage, services faced issues connecting to Kafka, autoscaling EC2 nodes, and AWS STS errors.
Postgres queues at Hightouch handle ~1M syncs/day and scale to 500K events per second.
Planned Aurora RDS upgrade on October 23rd encountered another race condition bug.
Aurora's architecture separates compute from storage, enabling fast failovers but introducing unique failure modes.
Upgrade plan involved adding a read replica, upgrading instances, and triggering a failover.
Failover attempts failed with the original writer remaining primary despite AWS showing a healthy cluster.
Investigation revealed a race condition during failover, causing both instances to crash.
AWS confirmed the bug was due to an internal signaling issue in the demotion process.
Mitigation involved pausing writers before intentional failovers and updating internal playbooks.
Key takeaways: prepare for worst-case scenarios, prioritize observability, and isolate system impacts.

Hasty Briefsbeta