Transformers Without Normalization

10 months ago

Normalization layers are commonly used in modern neural networks but may not be essential.
Dynamic Tanh (DyT) is introduced as a simple alternative to normalization layers in Transformers.
DyT is inspired by the observation that layer normalization often produces tanh-like mappings.
Transformers with DyT can match or exceed the performance of normalized counterparts.
The effectiveness of DyT is validated across various settings, including recognition, generation, and different learning paradigms.
The findings challenge the conventional belief that normalization layers are indispensable in neural networks.

Hasty Briefsbeta