The paper proposes HGRN2, an extension of the previously introduced Hierarchically Gated Linear Recurrent Neural Network (HGRN). The key contribution is the introduction of a simple outer-product-based state expansion mechanism that allows for significantly increasing the recurrent state size without adding any additional parameters.
The authors first explore structured matrices to directly expand the state size in a parameter-efficient manner, but find that these methods face training inefficiency issues due to the high-dimensional element-wise operations.
Inspired by linear attention, the authors then introduce the HGRN2 model, which replaces the element-wise gating operations with matrix/vector operations. This allows for efficient training by leveraging hardware-efficient kernels. The recurrent computation in HGRN2 resembles that of Gated Linear Attention (GLA), but with a unique perspective starting from the gated linear RNN framework.
The authors extensively evaluate HGRN2 on language modeling, image classification, and the Long Range Arena benchmark. HGRN2 outperforms the original HGRN model across all tasks, demonstrating the significant benefit of the state expansion mechanism. In large-scale language modeling experiments, the authors show that their largest 3B HGRN2 model slightly outperforms Mamba and LLaMa Architecture Transformer, while using much fewer total training tokens.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問