Hasty Briefsbeta

Bilingual

Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs

5 days ago
  • #gpu-diagnostics
  • #hardware-monitoring
  • #cross-platform-tools
  • zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs, and NPUs, offering real-time performance and health insights similar to nvidia-smi and nvtop.
  • It supports multiple platforms including NVIDIA, AMD, Google TPU, and AWS Trainium, with plans to expand as ZML grows. Installation involves downloading from an official mirror.
  • Features include device listing with 'zml-smi', real-time monitoring via '--top', and host-level metrics like CPU info and memory usage.
  • Process-level insights are available across platforms, showing resource usage and command lines for processes utilizing devices.
  • For NVIDIA, metrics from NVML include GPU utilization, temperature, power draw, VRAM, and PCIe stats. AMD support uses the AMD SMI library with sandboxed amdgpu.ids for latest GPUs.
  • TPU metrics are accessed via gRPC, providing TensorCore duty cycle and HBM data. AWS Trainium uses an embedded libnrt.so for metrics like core utilization and HBM usage.
  • The tool is completely sandboxed, requiring only device drivers and GLIBC, with creative workarounds for AMD file handling without system-wide installation.