Hasty Briefsbeta

Bilingual

We should get rid of average CPU utilization

12 hours ago
  • #Cloud Native
  • #Performance Monitoring
  • #CPU Throttling
  • Average CPU utilization is misleading for latency-sensitive applications because high utilization drastically increases wait times.
  • CPU throttling via cgroups can cause starvation even when average CPU usage appears low, especially with bursty workloads.
  • Key metrics to monitor include cgroup throttling (nr_throttled, throttled_usec), kernel PSI (cpu.pressure), hypervisor steal time (%st), and application-level starvation signals.
  • Application-side detection of CPU starvation is crucial for maintaining performance, as seen in tools like Redpanda and CockroachDB.
  • Compliance requirements often mandate CPU limits, but relying solely on average CPU graphs can hide critical performance issues.