FP8 Is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail

5 hours ago

Argues that native hardware FP64 is not essential for scientific computing on AI-optimized GPUs like NVIDIA's Blackwell Ultra (B300).
Proposes using FP8 tensor throughput with the Chinese Remainder Theorem-based Ozaki Scheme II to achieve full FP64 accuracy in HPC kernels.
Introduces the Tensor-Memory Equilibrium (TME) model to extend the Roofline model with compute and bandwidth multipliers and reconstruction latency.
Claims emulated FP64 using Ozaki II can reach ~500 TFLOPS on B300, surpassing native FP64 performance by over an order of magnitude.
Concludes that FP8, combined with Ozaki II and Kulisch fixed-point reconstruction, makes native FP64 silicon unnecessary for production HPC.

Hasty Briefsbeta