We Uncovered a Race Condition in Aurora RDS
8 days ago
- #Race Condition
- #AWS Outage
- #Aurora RDS
- AWS outage on October 20th caused by a race condition bug in a DNS management service.
- Hightouch Events system architecture relies on Kubernetes, Kafka, and Postgres.
- During the AWS outage, services faced issues connecting to Kafka, autoscaling EC2 nodes, and AWS STS errors.
- Postgres queues at Hightouch handle ~1M syncs/day and scale to 500K events per second.
- Planned Aurora RDS upgrade on October 23rd encountered another race condition bug.
- Aurora's architecture separates compute from storage, enabling fast failovers but introducing unique failure modes.
- Upgrade plan involved adding a read replica, upgrading instances, and triggering a failover.
- Failover attempts failed with the original writer remaining primary despite AWS showing a healthy cluster.
- Investigation revealed a race condition during failover, causing both instances to crash.
- AWS confirmed the bug was due to an internal signaling issue in the demotion process.
- Mitigation involved pausing writers before intentional failovers and updating internal playbooks.
- Key takeaways: prepare for worst-case scenarios, prioritize observability, and isolate system impacts.