AMD 2.0 – New Sense of Urgency

a year ago

AMD has made rapid progress in its AI software stack over the past four months, adopting a 'Developer First' strategy and improving CI/CD integration.
AMD's compensation for AI software engineers is significantly lower than competitors like NVIDIA, creating a talent retention challenge.
ROCm lacks first-class Python support compared to NVIDIA's CUDA, impacting developer usability and performance optimization.
The gap between AMD's RCCL and NVIDIA's NCCL is widening, with NCCL introducing advanced features like GPUDirect Async and user buffer registration.
AMD's internal development clusters are insufficient for long-term competitiveness, with short-term burst models hindering innovation.
AMD's MI325X and MI355X face weak customer interest, particularly compared to NVIDIA's rack-scale solutions like GB200 NVL72.
AMD plans to launch a community developer cloud in June, aiming to replicate Google's TPU Research Cloud success.
NVIDIA's CUDA thrives due to its massive ecosystem of external developers, while AMD struggles with slower bug fixes and feature adoption.
AMD's software infrastructure (Kubernetes, SLURM, Docker) lags behind its ML libraries, requiring more investment.
AMD lacks support for key inference features like disaggregated prefill and NVMe KV Cache Tiering, falling behind NVIDIA's Dynamo framework.

Hasty Briefsbeta