The paper introduces a novel adaptive activation function that combines the standard ReLU function with an additional even cubic nonlinearity term. The activation function is parameterized by two layer-dependent coefficients that are optimized during training.
The key highlights and insights are:
The proposed activation function preserves the underlying features of the ReLU function while improving its accuracy. This is achieved by introducing additional degrees of freedom through the optimizable parameters that enable the degree of nonlinearity to be adjusted.
Numerical experiments on the MNIST dataset show that the adaptive activation function outperforms the standard ReLU and swish activation functions in terms of predictive accuracy. However, this improvement comes at the cost of a higher number of non-converged results during training.
The tradeoff between convergence and accuracy is explored by adjusting the strength of the cubic term through a global constant, γ. Increasing γ leads to more accurate solutions but also a greater number of non-converged results, which can be rapidly identified and discarded without significantly affecting the overall computation time.
The results suggest that analytic activation functions, such as the swish function, yield smoother distributions of neural network predictions compared to the proposed activation function.
The improvement afforded by the proposed activation function is attributed to the presence of both odd and even terms in the function, indicating that maximally effective adaptive functions may require separately adjustable even and odd components.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы