Hasty Briefsbeta

We Uncovered a Race Condition in Aurora RDS

8 days ago
  • #Race Condition
  • #AWS Outage
  • #Aurora RDS
  • AWS outage on October 20th caused by a race condition bug in a DNS management service.
  • Hightouch Events system architecture relies on Kubernetes, Kafka, and Postgres.
  • During the AWS outage, services faced issues connecting to Kafka, autoscaling EC2 nodes, and AWS STS errors.
  • Postgres queues at Hightouch handle ~1M syncs/day and scale to 500K events per second.
  • Planned Aurora RDS upgrade on October 23rd encountered another race condition bug.
  • Aurora's architecture separates compute from storage, enabling fast failovers but introducing unique failure modes.
  • Upgrade plan involved adding a read replica, upgrading instances, and triggering a failover.
  • Failover attempts failed with the original writer remaining primary despite AWS showing a healthy cluster.
  • Investigation revealed a race condition during failover, causing both instances to crash.
  • AWS confirmed the bug was due to an internal signaling issue in the demotion process.
  • Mitigation involved pausing writers before intentional failovers and updating internal playbooks.
  • Key takeaways: prepare for worst-case scenarios, prioritize observability, and isolate system impacts.