Hasty Briefsbeta

Bilingual

Exposing Floating Point – Bartosz Ciechanowski

6 hours ago
  • #IEEE-754
  • #floating-point
  • #numerical-computing
  • Floating-point numbers are based on IEEE 754 binary floating-point format, with types like half, float, and double corresponding to binary16, binary32, and binary64.
  • Floating-point numbers are essentially base-2 scientific notation with limited significand digits and exponent range, leading to rounding errors for some values (e.g., 0.2) and inability to represent very large or small numbers.
  • The encoding of a float uses 1 bit for sign (0 for positive, 1 for negative), 8 bits for exponent (biased by 127), and 23 bits for significand (with an implicit leading 1, except for subnormals).
  • Special values include positive/negative zero (exponent 0, significand 0), infinities (exponent all 1s, significand 0), and NaNs (exponent all 1s, significand non-zero), each serving specific purposes in arithmetic.
  • Subnormals allow representation of numbers smaller than the minimum normal value by using an implicit leading 0 in the significand, though with reduced precision, ensuring properties like x - y = 0 when x = y.
  • The distribution of floating-point values is non-uniform, with spacing between representable numbers increasing with exponent; integers up to 2^24 can be represented exactly in a float.
  • Conversions between floating-point types may lose precision or cause overflow (e.g., to infinity) when converting to a smaller type, while conversions to a larger type preserve exact values.
  • Printing floating-point numbers accurately can be done using hexadecimal format (%a specifier), which provides an exact and concise representation, unlike decimal formats that may lose precision.