Cloudflare incident on August 21, 2025
2 days ago
- #Network Congestion
- #Cloudflare
- #AWS
- On August 21, 2025, a traffic surge from a single customer in AWS us-east-1 caused severe congestion between Cloudflare and AWS us-east-1, leading to high latency, packet loss, and connection failures.
- The incident started at 16:27 UTC and was mostly resolved by 19:38 UTC, with intermittent issues until 20:18 UTC.
- The congestion was due to a traffic surge that overloaded Cloudflare's links with AWS us-east-1, exacerbated by AWS withdrawing BGP advertisements to alleviate congestion.
- Cloudflare's internal network capacity was insufficient for this surge, partly due to a pre-existing half-capacity link and a pending DCI upgrade.
- Cloudflare and AWS collaborated to mitigate the issue, including rate-limiting the problematic customer and adjusting BGP advertisements.
- The incident highlighted the need for better customer isolation and network capacity to prevent similar issues in the future.
- Short-term solutions include deprioritizing traffic from customers causing congestion and expediting DCI upgrades.
- Long-term solutions involve building a new traffic management system to allocate network resources per customer and automate congestion responses.
- Cloudflare apologized for the disruption and is implementing improvements to prevent recurrence.