Core Concepts
A novel adaptive activation function with an even cubic nonlinearity is introduced that enhances the accuracy of neural networks without substantial additional computational resources, while exhibiting a tradeoff between convergence and accuracy.
Abstract
The paper introduces a novel adaptive activation function that combines the standard ReLU function with an additional even cubic nonlinearity term. The activation function is parameterized by two layer-dependent coefficients that are optimized during training.
The key highlights and insights are:
The proposed activation function preserves the underlying features of the ReLU function while improving its accuracy. This is achieved by introducing additional degrees of freedom through the optimizable parameters that enable the degree of nonlinearity to be adjusted.
Numerical experiments on the MNIST dataset show that the adaptive activation function outperforms the standard ReLU and swish activation functions in terms of predictive accuracy. However, this improvement comes at the cost of a higher number of non-converged results during training.
The tradeoff between convergence and accuracy is explored by adjusting the strength of the cubic term through a global constant, Ξ³. Increasing Ξ³ leads to more accurate solutions but also a greater number of non-converged results, which can be rapidly identified and discarded without significantly affecting the overall computation time.
The results suggest that analytic activation functions, such as the swish function, yield smoother distributions of neural network predictions compared to the proposed activation function.
The improvement afforded by the proposed activation function is attributed to the presence of both odd and even terms in the function, indicating that maximally effective adaptive functions may require separately adjustable even and odd components.
Stats
The paper presents the following key figures and metrics:
"The test accuracy during and after 60 separate calculations for a 512/50/10 dense neural network with a relu activation function."
"As in Figure 1 but for a swish activation function."
"As in Figure 1 but for the activation function introduced in this paper with Ξ³= 5 and 150 realizations."
"The histogram of Figure 3 but for Ξ³= 1 (left plot) and Ξ³= 2.5 (right plot) and 60 realizations."
"The histogram of Figure 3 with Ξ³= 1 and without absolute value sign in the cubic term in the activation function for 180 realizations."