Hasty Briefsbeta

Bilingual

Cloudflare's approach to global service health metrics and software releases

a year ago
  • #Cloudflare
  • #DevOps
  • #Observability
  • Cloudflare's Health Mediated Deployments (HMD) automates software updates using data-driven metrics to prevent widespread issues.
  • HMD uses Prometheus and Thanos to monitor service health, reverting problematic updates before they affect users.
  • Service Level Objectives (SLOs) and Indicators (SLIs) help HMD detect and respond to performance degradation.
  • Backtesting with historical incident data ensures HMD can quickly react to future issues.
  • Cloudflare stores 4.5 billion time series in R2, totaling 8 petabytes of data for a year of retention.
  • Optimizations like recording rules and distributed query processing reduce batch runtimes from 30 hours to 2 hours.
  • Congestion control mechanisms prioritize critical queries and smooth out batch processing spikes.
  • Experimental Parquet-based storage shows promise for optimizing time series data handling.