Hasty Briefsbeta

Bilingual

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

12 hours ago
  • #Cloudflare
  • #Infrastructure Resilience
  • #Incident Management
  • Cloudflare completed an engineering initiative called 'Code Orange: Fail Small' to enhance infrastructure resilience, security, and reliability following past global outages.
  • Implemented safer configuration changes using 'health-mediated deployment' via Snapstone, allowing gradual rollouts with real-time monitoring and automatic rollback to prevent widespread impact.
  • Reduced failure impact by enabling systems to use last known good configurations, fail open or closed appropriately, and segment services by customer cohorts to limit blast radius.
  • Revised 'break glass' and incident management procedures, adding backup authorization pathways and conducting drills to improve response and communication during outages.
  • Codified improvements in an internal Codex with enforceable rules via AI reviews, institutionalizing lessons learned to prevent regressions and ensure best practices.
  • Strengthened customer communication during outages, committing to timely alerts, regular updates, and detailed post-mortems for transparency and planning.