Cloudflare's approach to global service health metrics and software releases

a year ago

Cloudflare's Health Mediated Deployments (HMD) automates software updates using data-driven metrics to prevent widespread issues.
HMD uses Prometheus and Thanos to monitor service health, reverting problematic updates before they affect users.
Service Level Objectives (SLOs) and Indicators (SLIs) help HMD detect and respond to performance degradation.
Backtesting with historical incident data ensures HMD can quickly react to future issues.
Cloudflare stores 4.5 billion time series in R2, totaling 8 petabytes of data for a year of retention.
Optimizations like recording rules and distributed query processing reduce batch runtimes from 30 hours to 2 hours.
Congestion control mechanisms prioritize critical queries and smooth out batch processing spikes.
Experimental Parquet-based storage shows promise for optimizing time series data handling.

Hasty Briefsbeta