Bias Compounds, Variance Washes Out

a month ago

Round-to-nearest rounding consistently compounds the same error over multiple updates, leading to bias.
Stochastic rounding introduces zero-mean random errors that partly cancel out over time, reducing bias.
In an example adding 0.001 to 1.0 a thousand times in BF16, round-to-nearest stays at 1.0, while stochastic rounding reaches 2.0 in expectation.
Biased errors grow linearly (O(n)), while unbiased errors grow more slowly (O(√n)), making stochastic rounding advantageous for long runs of small updates.
An experiment training an MLP on a teacher-student task shows BF16 with stochastic rounding matches FP32 performance with less memory, while round-to-nearest plateaus.
Stochastic rounding adds no extra memory or bandwidth overhead but removes bias, allowing six bytes (BF16) to match ten bytes (FP32) in optimizer state.

Hasty Briefsbeta