How when AWS was down, we were not
5 days ago
- #Reliability
- #Auth
- #AWS
- AWS us-east-1 experienced a major outage on October 20th, impacting DynamoDB DNS and causing widespread service disruptions.
- Authress maintains high reliability (99.999% SLA) despite AWS outages by implementing multi-region failover and dynamic DNS routing.
- Critical AWS services like CloudFront, Certificate Manager, and IAM have control planes in us-east-1, making region failures impactful.
- Authress uses Route 53 health checks and failover routing to switch regions during outages, ensuring minimal downtime.
- Validation tests and anomaly detection (e.g., Authorization Ratio) help identify and mitigate issues before customers are affected.
- Incremental rollouts and customer deployment buckets reduce the impact of bugs by limiting exposure during deployments.
- Authress employs rate limiting and AWS WAF with managed IP reputation lists to prevent resource exhaustion and block malicious traffic.
- Customer support is integrated directly with engineering to quickly address incidents and reduce resolution time.
- Infrastructure as Code (IaC) challenges arise when deploying slightly different architectures across primary, backup, and edge regions.
- Despite robust measures, achieving a 5-nines SLA requires continuous improvement and vigilance against new failure modes.