Transformers Without Normalization
9 months ago
- #Transformers
- #Machine Learning
- #Normalization
- Normalization layers are commonly used in modern neural networks but may not be essential.
- Dynamic Tanh (DyT) is introduced as a simple alternative to normalization layers in Transformers.
- DyT is inspired by the observation that layer normalization often produces tanh-like mappings.
- Transformers with DyT can match or exceed the performance of normalized counterparts.
- The effectiveness of DyT is validated across various settings, including recognition, generation, and different learning paradigms.
- The findings challenge the conventional belief that normalization layers are indispensable in neural networks.