Exposing Floating Point – Bartosz Ciechanowski

a month ago

#IEEE-754
#floating-point
#numerical-computing

Floating-point numbers are based on IEEE 754 binary floating-point format, with types like half, float, and double corresponding to binary16, binary32, and binary64.
Floating-point numbers are essentially base-2 scientific notation with limited significand digits and exponent range, leading to rounding errors for some values (e.g., 0.2) and inability to represent very large or small numbers.
The encoding of a float uses 1 bit for sign (0 for positive, 1 for negative), 8 bits for exponent (biased by 127), and 23 bits for significand (with an implicit leading 1, except for subnormals).
Special values include positive/negative zero (exponent 0, significand 0), infinities (exponent all 1s, significand 0), and NaNs (exponent all 1s, significand non-zero), each serving specific purposes in arithmetic.
Subnormals allow representation of numbers smaller than the minimum normal value by using an implicit leading 0 in the significand, though with reduced precision, ensuring properties like x - y = 0 when x = y.
The distribution of floating-point values is non-uniform, with spacing between representable numbers increasing with exponent; integers up to 2^24 can be represented exactly in a float.
Conversions between floating-point types may lose precision or cause overflow (e.g., to infinity) when converting to a smaller type, while conversions to a larger type preserve exact values.
Printing floating-point numbers accurately can be done using hexadecimal format (%a specifier), which provides an exact and concise representation, unlike decimal formats that may lose precision.

Hasty Briefsbeta

Exposing Floating Point – Bartosz Ciechanowski