Hasty Briefsbeta

Bilingual

15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern

6 days ago
  • #FP64
  • #AI
  • #GPU
  • The RTX 5090 offers 104.8 TFLOPS of FP32 compute but only 1.64 TFLOPS of FP64, highlighting a 64:1 performance gap.
  • Nvidia has progressively widened the FP64:FP32 ratio on consumer GPUs from 1:8 in 2010 (Fermi) to 1:64 in 2020 (Ampere), while enterprise GPUs maintained a 1:2 or 1:3 ratio.
  • FP64 performance on consumer GPUs increased only 9.65x from 2010 to 2025, compared to a 77.63x increase in FP32 performance.
  • Market segmentation is the primary reason for limiting FP64 on consumer GPUs, as most consumer workloads don't require double-precision math.
  • Enterprise GPUs justified higher prices with strong FP64 performance, ECC memory, NVLink, and support contracts.
  • AI training relies more on lower precision (FP16, BF16, FP8, FP4), making consumer GPUs viable for compute workloads, leading Nvidia to restrict datacenter use via EULA in 2017.
  • FP64 emulation techniques, like Dekker's double-float arithmetic and the Ozaki scheme, allow consumer GPUs to perform high-precision calculations using FP32 or lower-precision tensor cores.
  • Nvidia's latest enterprise GPUs (B300) reduce FP64 performance in favor of low-precision tensor cores, aligning with AI demands.
  • The next market segmentation may shift from FP64 to low-precision floating-point ratios (e.g., FP16:FP32).