Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
12 hours ago
- #flash attention
- #low-precision training
- #transformer
- Low-precision training of transformer models with flash attention leads to catastrophic loss explosions.
- The failure is caused by low-rank representations and biased rounding errors in low-precision arithmetic.
- A minimal modification to flash attention mitigates the bias in rounding errors, stabilizing training.