Code Orange: Fail Small is complete. The result is a stronger Cloudflare network
10 hours ago
- #Cloudflare
- #Infrastructure Resilience
- #Incident Management
- Cloudflare completed an engineering initiative called 'Code Orange: Fail Small' to enhance infrastructure resilience, security, and reliability following past global outages.
- Implemented safer configuration changes using 'health-mediated deployment' via Snapstone, allowing gradual rollouts with real-time monitoring and automatic rollback to prevent widespread impact.
- Reduced failure impact by enabling systems to use last known good configurations, fail open or closed appropriately, and segment services by customer cohorts to limit blast radius.
- Revised 'break glass' and incident management procedures, adding backup authorization pathways and conducting drills to improve response and communication during outages.
- Codified improvements in an internal Codex with enforceable rules via AI reviews, institutionalizing lessons learned to prevent regressions and ensure best practices.
- Strengthened customer communication during outages, committing to timely alerts, regular updates, and detailed post-mortems for transparency and planning.