Hasty Briefsbeta

Bilingual

Bias Compounds, Variance Washes Out

3 days ago
  • #stochastic rounding
  • #numerical error
  • #machine learning optimization
  • Round-to-nearest rounding consistently compounds the same error over multiple updates, leading to bias.
  • Stochastic rounding introduces zero-mean random errors that partly cancel out over time, reducing bias.
  • In an example adding 0.001 to 1.0 a thousand times in BF16, round-to-nearest stays at 1.0, while stochastic rounding reaches 2.0 in expectation.
  • Biased errors grow linearly (O(n)), while unbiased errors grow more slowly (O(√n)), making stochastic rounding advantageous for long runs of small updates.
  • An experiment training an MLP on a teacher-student task shows BF16 with stochastic rounding matches FP32 performance with less memory, while round-to-nearest plateaus.
  • Stochastic rounding adds no extra memory or bandwidth overhead but removes bias, allowing six bytes (BF16) to match ten bytes (FP32) in optimizer state.