Griffin presents a novel approach in language modeling by combining gated linear recurrences with local attention, demonstrating improved performance and efficiency. The study compares Griffin to Transformer baselines, highlighting its ability to scale efficiently on long sequences and achieve high throughput during inference. Additionally, Griffin shows promise in tasks requiring copying and retrieval capabilities, outperforming traditional models in certain scenarios.
Recurrent neural networks have played a significant role in deep learning and NLP research but face challenges in training and scalability. The study proposes Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model mixing gated linear recurrences with local attention. Both models show superior performance on downstream tasks compared to existing models like Mamba and Llama-2.
The study delves into the architecture of Hawk and Griffin, detailing their components such as residual blocks, MLP blocks, and temporal-mixing blocks. It highlights the power law scaling exhibited by both models between held-out loss and training FLOPs across different model scales.
Griffin's Real-Gated Linear Recurrent Unit (RG-LRU) layer is introduced as a novel recurrent layer inspired by previous works on non-linear RNNs. The study emphasizes the efficient implementation of linear recurrences on devices like TPUs for optimal training speed.
In terms of inference speed, Griffin showcases lower latency and higher throughput during decoding stages compared to traditional Transformers. The study also explores the models' capabilities in long context modeling and tasks requiring copying and retrieval abilities.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Soham De,Sam... at arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.19427.pdfDeeper Inquiries