Cloudflare's approach to global service health metrics and software releases
a year ago
- #Cloudflare
- #DevOps
- #Observability
- Cloudflare's Health Mediated Deployments (HMD) automates software updates using data-driven metrics to prevent widespread issues.
- HMD uses Prometheus and Thanos to monitor service health, reverting problematic updates before they affect users.
- Service Level Objectives (SLOs) and Indicators (SLIs) help HMD detect and respond to performance degradation.
- Backtesting with historical incident data ensures HMD can quickly react to future issues.
- Cloudflare stores 4.5 billion time series in R2, totaling 8 petabytes of data for a year of retention.
- Optimizations like recording rules and distributed query processing reduce batch runtimes from 30 hours to 2 hours.
- Congestion control mechanisms prioritize critical queries and smooth out batch processing spikes.
- Experimental Parquet-based storage shows promise for optimizing time series data handling.