Stronger Normalization-Free Transformers
2 days ago
- #Transformers
- #Machine Learning
- #Deep Learning
- Dynamic Tanh (DyT) has shown that alternatives to normalization layers in deep learning are possible.
- This work explores point-wise function designs to surpass DyT's performance.
- A large-scale search led to the introduction of Derf(x) = erf(αx + s), which outperforms LayerNorm, RMSNorm, and DyT.
- Derf excels in various domains including vision, speech representation, and DNA sequence modeling.
- Performance gains of Derf are attributed to improved generalization rather than stronger fitting capacity.
- Derf is presented as a practical choice for normalization-free Transformer architectures due to its simplicity and superior performance.