Virtualizing Nvidia HGX B200 GPUs with Open Source
a day ago
- #Open-Source
- #GPU Virtualization
- #NVIDIA B200
- GPU VMs enabled on NVIDIA’s B200 HGX machines, which are trickier to virtualize than H100s.
- B200 HGX uses SXM modules and NVLink for high-bandwidth GPU-to-GPU connectivity, making virtualization challenging.
- Three virtualization models: Full Passthrough Mode, vGPU, and Shared NVSwitch Multitenancy Mode.
- Shared NVSwitch Multitenancy Mode supports 1-, 2-, 4-, and 8-GPU VMs with full NVLink bandwidth.
- Host preparation involves binding GPUs to vfio-pci driver and configuring IOMMU support.
- Matching driver versions between host and VM is critical for Shared NVSwitch Multitenancy.
- PCI topology mismatch can cause CUDA initialization failures; QEMU can recreate correct hierarchy.
- Large-BAR stalls during VM boot can be resolved by upgrading QEMU or disabling BAR mmap.
- Fabric Manager controls GPU partitions and enforces isolation in Shared NVSwitch Multitenancy Mode.
- Open-source implementation available in Ubicloud, with components for GPU allocation and VM launch.