The state of cloud GPUs in 2025: costs, performance, playbooks
21 hours ago
- #AI-infrastructure
- #GPU
- #cloud-computing
- The article provides a practical guide for teams renting GPUs, covering costs, performance, and strategies for multi-cloud environments.
- Market segmentation is based on target scale and automation maturity, dividing providers into categories like classical hyperscalers, massive neoclouds, and cloud marketplaces.
- NVIDIA remains dominant due to CUDA and tooling maturity, but AMD's ROCm and MI series are becoming viable alternatives with competitive memory and bandwidth.
- Key factors affecting GPU performance include memory, fabric bandwidth, topology, local NVMe, network volumes, and orchestration.
- Pricing models vary, with commitments offering discounts but carrying utilization risks, while on-demand and spot options provide flexibility.
- Quotas and approvals can restrict access to GPUs, making multi-cloud strategies essential for some teams.
- New GPU generations focus on memory and bandwidth scaling, improved fabrics, and cost-effective prefill vs. decode splits.
- Control planes are crucial for maximizing utilization, enforcing portability, and managing multi-cloud environments efficiently.
- Final takeaways emphasize the importance of price vs. cost, workload-matched commitments, multi-cloud strategies, and leveraging control planes.
- The report acknowledges limitations in provider coverage and methodology, with plans for future updates on price normalization and benchmarks.