We should get rid of average CPU utilization
12 hours ago
- #Cloud Native
- #Performance Monitoring
- #CPU Throttling
- Average CPU utilization is misleading for latency-sensitive applications because high utilization drastically increases wait times.
- CPU throttling via cgroups can cause starvation even when average CPU usage appears low, especially with bursty workloads.
- Key metrics to monitor include cgroup throttling (nr_throttled, throttled_usec), kernel PSI (cpu.pressure), hypervisor steal time (%st), and application-level starvation signals.
- Application-side detection of CPU starvation is crucial for maintaining performance, as seen in tools like Redpanda and CockroachDB.
- Compliance requirements often mandate CPU limits, but relying solely on average CPU graphs can hide critical performance issues.