When etcd crashes, check your disks first

2 months ago

ETCD crashes were caused by slow storage I/O latency in a cloud-edge continuum testbed setup.
The demo involved Karmada orchestrating k3s clusters across a NUC, Raspberry Pi, and Jetson AGX Orin for real-time object detection.
Karmada pods crashed periodically due to etcd timeouts, traced back to inconsistent I/O performance on shared VM storage.
ZFS tuning (disabling sync writes, enabling LZ4 compression, disabling atime, setting recordsize to 8K) resolved the etcd stability issues.
Key lesson: When etcd crashes, first investigate disk I/O performance, especially in shared or non-dedicated storage environments.
The demo successfully showcased adaptive policy-driven orchestration switching workloads from Raspberry Pi to Jetson AGX Orin based on telemetry.

Hasty Briefsbeta