Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs
5 days ago
- #gpu-diagnostics
- #hardware-monitoring
- #cross-platform-tools
- zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs, and NPUs, offering real-time performance and health insights similar to nvidia-smi and nvtop.
- It supports multiple platforms including NVIDIA, AMD, Google TPU, and AWS Trainium, with plans to expand as ZML grows. Installation involves downloading from an official mirror.
- Features include device listing with 'zml-smi', real-time monitoring via '--top', and host-level metrics like CPU info and memory usage.
- Process-level insights are available across platforms, showing resource usage and command lines for processes utilizing devices.
- For NVIDIA, metrics from NVML include GPU utilization, temperature, power draw, VRAM, and PCIe stats. AMD support uses the AMD SMI library with sandboxed amdgpu.ids for latest GPUs.
- TPU metrics are accessed via gRPC, providing TensorCore duty cycle and HBM data. AWS Trainium uses an embedded libnrt.so for metrics like core utilization and HBM usage.
- The tool is completely sandboxed, requiring only device drivers and GLIBC, with creative workarounds for AMD file handling without system-wide installation.