Floating Point from Scratch
3 days ago
- #verification
- #hardware-design
- #floating-point
- Floating point representation involves sign, exponent, and mantissa fields, with formats like IEEE 754 defining behavior for operations, including special cases.
- Special floating point values include +0/-0, NaN (quiet and signaling), and ±∞, each with specific rules for arithmetic and comparisons.
- Denormal (subnormal) numbers allow gradual underflow, preventing abrupt loss of precision near zero, but add implementation complexity.
- Rounding modes (e.g., RD, RU, RZ, RN) determine how unrepresentable results are handled, affecting overflow and underflow behavior.
- Comparisons with NaN result in 'unordered' outcomes, breaking traditional laws like trichotomy, where x != x can be true.
- Hardware implementation of floating point arithmetic optimizes for area and performance, using architectures like dual-path adders for high-speed operations.
- The bfloat16 format, with 8-bit exponent and 7-bit mantissa, is chosen for AI workloads due to its range and simpler hardware, though it lacks a strict spec.
- Verification of floating point hardware requires exhaustive testing, often using formal methods or simulation across all input combinations to cover corner cases.
- C++ standard library's bfloat16 implementation uses float32 internally, leading to precision differences (within 1 ulp) compared to custom hardware with p=8.
- Tapeouts on IHP 130nm nodes demonstrate practical implementations, with optimizations like custom multipliers and LZC designs achieving high frequencies (e.g., 454.545 MHz).