15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern

6 days ago

The RTX 5090 offers 104.8 TFLOPS of FP32 compute but only 1.64 TFLOPS of FP64, highlighting a 64:1 performance gap.
Nvidia has progressively widened the FP64:FP32 ratio on consumer GPUs from 1:8 in 2010 (Fermi) to 1:64 in 2020 (Ampere), while enterprise GPUs maintained a 1:2 or 1:3 ratio.
FP64 performance on consumer GPUs increased only 9.65x from 2010 to 2025, compared to a 77.63x increase in FP32 performance.
Market segmentation is the primary reason for limiting FP64 on consumer GPUs, as most consumer workloads don't require double-precision math.
Enterprise GPUs justified higher prices with strong FP64 performance, ECC memory, NVLink, and support contracts.
AI training relies more on lower precision (FP16, BF16, FP8, FP4), making consumer GPUs viable for compute workloads, leading Nvidia to restrict datacenter use via EULA in 2017.
FP64 emulation techniques, like Dekker's double-float arithmetic and the Ozaki scheme, allow consumer GPUs to perform high-precision calculations using FP32 or lower-precision tensor cores.
Nvidia's latest enterprise GPUs (B300) reduce FP64 performance in favor of low-precision tensor cores, aligning with AI demands.
The next market segmentation may shift from FP64 to low-precision floating-point ratios (e.g., FP16:FP32).

Hasty Briefsbeta