Hasty Briefsbeta

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

12 hours ago
  • #flash attention
  • #low-precision training
  • #transformer
  • Low-precision training of transformer models with flash attention leads to catastrophic loss explosions.
  • The failure is caused by low-rank representations and biased rounding errors in low-precision arithmetic.
  • A minimal modification to flash attention mitigates the bias in rounding errors, stabilizing training.