15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern
6 days ago
- #FP64
- #AI
- #GPU
- The RTX 5090 offers 104.8 TFLOPS of FP32 compute but only 1.64 TFLOPS of FP64, highlighting a 64:1 performance gap.
- Nvidia has progressively widened the FP64:FP32 ratio on consumer GPUs from 1:8 in 2010 (Fermi) to 1:64 in 2020 (Ampere), while enterprise GPUs maintained a 1:2 or 1:3 ratio.
- FP64 performance on consumer GPUs increased only 9.65x from 2010 to 2025, compared to a 77.63x increase in FP32 performance.
- Market segmentation is the primary reason for limiting FP64 on consumer GPUs, as most consumer workloads don't require double-precision math.
- Enterprise GPUs justified higher prices with strong FP64 performance, ECC memory, NVLink, and support contracts.
- AI training relies more on lower precision (FP16, BF16, FP8, FP4), making consumer GPUs viable for compute workloads, leading Nvidia to restrict datacenter use via EULA in 2017.
- FP64 emulation techniques, like Dekker's double-float arithmetic and the Ozaki scheme, allow consumer GPUs to perform high-precision calculations using FP32 or lower-precision tensor cores.
- Nvidia's latest enterprise GPUs (B300) reduce FP64 performance in favor of low-precision tensor cores, aligning with AI demands.
- The next market segmentation may shift from FP64 to low-precision floating-point ratios (e.g., FP16:FP32).