Better Activation Functions for NNUE

3 days ago

Experimented with replacing SCReLUs in Viridithas's NNUE with Swish in layers L₁ and L₂.
Encountered teething problems with Hard-Swish leading to lower sparsity in L₀ output activations, affecting performance.
Solved the sparsity issue by adding a regularization term to the loss function, penalizing dense activations.
Swish networks showed smoother evaluation scale and significant Elo improvements over SCReLU baselines.
Further strength improvements were achieved by replacing Swish with SwiGLU in L₂.
Final activation sequence in Viridithas resembles smooth versions of CReLU and SCReLU, similar to findings in PlentyChess.
Author expresses enthusiasm for integrating more deep learning techniques into NNUE design, hinting at future explorations.

Hasty Briefsbeta