Hasty Briefsbeta

Bilingual

Floating Point from Scratch

3 days ago
  • #verification
  • #hardware-design
  • #floating-point
  • Floating point representation involves sign, exponent, and mantissa fields, with formats like IEEE 754 defining behavior for operations, including special cases.
  • Special floating point values include +0/-0, NaN (quiet and signaling), and ±∞, each with specific rules for arithmetic and comparisons.
  • Denormal (subnormal) numbers allow gradual underflow, preventing abrupt loss of precision near zero, but add implementation complexity.
  • Rounding modes (e.g., RD, RU, RZ, RN) determine how unrepresentable results are handled, affecting overflow and underflow behavior.
  • Comparisons with NaN result in 'unordered' outcomes, breaking traditional laws like trichotomy, where x != x can be true.
  • Hardware implementation of floating point arithmetic optimizes for area and performance, using architectures like dual-path adders for high-speed operations.
  • The bfloat16 format, with 8-bit exponent and 7-bit mantissa, is chosen for AI workloads due to its range and simpler hardware, though it lacks a strict spec.
  • Verification of floating point hardware requires exhaustive testing, often using formal methods or simulation across all input combinations to cover corner cases.
  • C++ standard library's bfloat16 implementation uses float32 internally, leading to precision differences (within 1 ulp) compared to custom hardware with p=8.
  • Tapeouts on IHP 130nm nodes demonstrate practical implementations, with optimizations like custom multipliers and LZC designs achieving high frequencies (e.g., 454.545 MHz).