Floating Point from Scratch

3 days ago

Floating point representation involves sign, exponent, and mantissa fields, with formats like IEEE 754 defining behavior for operations, including special cases.
Special floating point values include +0/-0, NaN (quiet and signaling), and ±∞, each with specific rules for arithmetic and comparisons.
Denormal (subnormal) numbers allow gradual underflow, preventing abrupt loss of precision near zero, but add implementation complexity.
Rounding modes (e.g., RD, RU, RZ, RN) determine how unrepresentable results are handled, affecting overflow and underflow behavior.
Comparisons with NaN result in 'unordered' outcomes, breaking traditional laws like trichotomy, where x != x can be true.
Hardware implementation of floating point arithmetic optimizes for area and performance, using architectures like dual-path adders for high-speed operations.
The bfloat16 format, with 8-bit exponent and 7-bit mantissa, is chosen for AI workloads due to its range and simpler hardware, though it lacks a strict spec.
Verification of floating point hardware requires exhaustive testing, often using formal methods or simulation across all input combinations to cover corner cases.
C++ standard library's bfloat16 implementation uses float32 internally, leading to precision differences (within 1 ulp) compared to custom hardware with p=8.
Tapeouts on IHP 130nm nodes demonstrate practical implementations, with optimizations like custom multipliers and LZC designs achieving high frequencies (e.g., 454.545 MHz).

Hasty Briefsbeta