Hasty Briefsbeta

Bilingual

Transformers Without Normalization

9 months ago
  • #Transformers
  • #Machine Learning
  • #Normalization
  • Normalization layers are commonly used in modern neural networks but may not be essential.
  • Dynamic Tanh (DyT) is introduced as a simple alternative to normalization layers in Transformers.
  • DyT is inspired by the observation that layer normalization often produces tanh-like mappings.
  • Transformers with DyT can match or exceed the performance of normalized counterparts.
  • The effectiveness of DyT is validated across various settings, including recognition, generation, and different learning paradigms.
  • The findings challenge the conventional belief that normalization layers are indispensable in neural networks.