Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs

5 days ago

zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs, and NPUs, offering real-time performance and health insights similar to nvidia-smi and nvtop.
It supports multiple platforms including NVIDIA, AMD, Google TPU, and AWS Trainium, with plans to expand as ZML grows. Installation involves downloading from an official mirror.
Features include device listing with 'zml-smi', real-time monitoring via '--top', and host-level metrics like CPU info and memory usage.
Process-level insights are available across platforms, showing resource usage and command lines for processes utilizing devices.
For NVIDIA, metrics from NVML include GPU utilization, temperature, power draw, VRAM, and PCIe stats. AMD support uses the AMD SMI library with sandboxed amdgpu.ids for latest GPUs.
TPU metrics are accessed via gRPC, providing TensorCore duty cycle and HBM data. AWS Trainium uses an embedded libnrt.so for metrics like core utilization and HBM usage.
The tool is completely sandboxed, requiring only device drivers and GLIBC, with creative workarounds for AMD file handling without system-wide installation.

Hasty Briefsbeta