Bias Compounds, Variance Washes Out
3 days ago
- #stochastic rounding
- #numerical error
- #machine learning optimization
- Round-to-nearest rounding consistently compounds the same error over multiple updates, leading to bias.
- Stochastic rounding introduces zero-mean random errors that partly cancel out over time, reducing bias.
- In an example adding 0.001 to 1.0 a thousand times in BF16, round-to-nearest stays at 1.0, while stochastic rounding reaches 2.0 in expectation.
- Biased errors grow linearly (O(n)), while unbiased errors grow more slowly (O(√n)), making stochastic rounding advantageous for long runs of small updates.
- An experiment training an MLP on a teacher-student task shows BF16 with stochastic rounding matches FP32 performance with less memory, while round-to-nearest plateaus.
- Stochastic rounding adds no extra memory or bandwidth overhead but removes bias, allowing six bytes (BF16) to match ten bytes (FP32) in optimizer state.