When etcd crashes, check your disks first
4 days ago
- #etcd
- #debugging
- #Kubernetes
- ETCD crashes were caused by slow storage I/O latency in a cloud-edge continuum testbed setup.
- The demo involved Karmada orchestrating k3s clusters across a NUC, Raspberry Pi, and Jetson AGX Orin for real-time object detection.
- Karmada pods crashed periodically due to etcd timeouts, traced back to inconsistent I/O performance on shared VM storage.
- ZFS tuning (disabling sync writes, enabling LZ4 compression, disabling atime, setting recordsize to 8K) resolved the etcd stability issues.
- Key lesson: When etcd crashes, first investigate disk I/O performance, especially in shared or non-dedicated storage environments.
- The demo successfully showcased adaptive policy-driven orchestration switching workloads from Raspberry Pi to Jetson AGX Orin based on telemetry.