Hasty Briefsbeta

Bilingual

When etcd crashes, check your disks first

4 days ago
  • #etcd
  • #debugging
  • #Kubernetes
  • ETCD crashes were caused by slow storage I/O latency in a cloud-edge continuum testbed setup.
  • The demo involved Karmada orchestrating k3s clusters across a NUC, Raspberry Pi, and Jetson AGX Orin for real-time object detection.
  • Karmada pods crashed periodically due to etcd timeouts, traced back to inconsistent I/O performance on shared VM storage.
  • ZFS tuning (disabling sync writes, enabling LZ4 compression, disabling atime, setting recordsize to 8K) resolved the etcd stability issues.
  • Key lesson: When etcd crashes, first investigate disk I/O performance, especially in shared or non-dedicated storage environments.
  • The demo successfully showcased adaptive policy-driven orchestration switching workloads from Raspberry Pi to Jetson AGX Orin based on telemetry.