Core Concepts
This research paper introduces PolyCom, a novel activation function designed to enhance the performance of large language models (LLMs) by enabling them to capture higher-order interactions within data, leading to improved accuracy and faster convergence rates compared to traditional activation functions.
Stats
PolyNorm outperforms SwiGLU by an average margin of 1.21% across six downstream tasks in the 1B dense model.
PolyNorm consistently achieves lower validation losses than SwiGLU across all eleven datasets in the MoE model, with an average improvement of 0.02.
Quotes
"In this paper, we introduce a novel category of polynomial composition activation functions (PolyCom), specifically engineered to enhance the performance of transformer architectures."
"Unlike other forms of polynomials [5, 38] that suffer from inadequate approximation, exploding values, and oscillatory behavior, we demonstrate that PolyCom possesses a more potent expressive capability than both ReLU and traditional polynomials and achieves optimal approximation within Sobolev space."
"The results indicate that the proposed method not only improves model accuracy, but also accelerates convergence rates, thereby suggesting that polynomial composition activations provide a substantive advantage in deep learning applications."