Hasty Briefsbeta

Stronger Normalization-Free Transformers

2 days ago
  • #Transformers
  • #Machine Learning
  • #Deep Learning
  • Dynamic Tanh (DyT) has shown that alternatives to normalization layers in deep learning are possible.
  • This work explores point-wise function designs to surpass DyT's performance.
  • A large-scale search led to the introduction of Derf(x) = erf(αx + s), which outperforms LayerNorm, RMSNorm, and DyT.
  • Derf excels in various domains including vision, speech representation, and DNA sequence modeling.
  • Performance gains of Derf are attributed to improved generalization rather than stronger fitting capacity.
  • Derf is presented as a practical choice for normalization-free Transformer architectures due to its simplicity and superior performance.