FP8 Is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail
5 hours ago
- #AI-Optimized GPUs
- #HPC Hardware
- #FP8 Computing
- Argues that native hardware FP64 is not essential for scientific computing on AI-optimized GPUs like NVIDIA's Blackwell Ultra (B300).
- Proposes using FP8 tensor throughput with the Chinese Remainder Theorem-based Ozaki Scheme II to achieve full FP64 accuracy in HPC kernels.
- Introduces the Tensor-Memory Equilibrium (TME) model to extend the Roofline model with compute and bandwidth multipliers and reconstruction latency.
- Claims emulated FP64 using Ozaki II can reach ~500 TFLOPS on B300, surpassing native FP64 performance by over an order of magnitude.
- Concludes that FP8, combined with Ozaki II and Kulisch fixed-point reconstruction, makes native FP64 silicon unnecessary for production HPC.