Better Activation Functions for NNUE
3 days ago
- #NNUE
- #Activation Functions
- #Deep Learning
- Experimented with replacing SCReLUs in Viridithas's NNUE with Swish in layers L₁ and L₂.
- Encountered teething problems with Hard-Swish leading to lower sparsity in L₀ output activations, affecting performance.
- Solved the sparsity issue by adding a regularization term to the loss function, penalizing dense activations.
- Swish networks showed smoother evaluation scale and significant Elo improvements over SCReLU baselines.
- Further strength improvements were achieved by replacing Swish with SwiGLU in L₂.
- Final activation sequence in Viridithas resembles smooth versions of CReLU and SCReLU, similar to findings in PlentyChess.
- Author expresses enthusiasm for integrating more deep learning techniques into NNUE design, hinting at future explorations.