Core Concepts
Griffin introduces a hybrid model combining gated linear recurrences with local attention, showcasing superior performance and efficiency compared to traditional models.
Abstract
Griffin presents a novel approach in language modeling by combining gated linear recurrences with local attention, demonstrating improved performance and efficiency. The study compares Griffin to Transformer baselines, highlighting its ability to scale efficiently on long sequences and achieve high throughput during inference. Additionally, Griffin shows promise in tasks requiring copying and retrieval capabilities, outperforming traditional models in certain scenarios.
Recurrent neural networks have played a significant role in deep learning and NLP research but face challenges in training and scalability. The study proposes Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model mixing gated linear recurrences with local attention. Both models show superior performance on downstream tasks compared to existing models like Mamba and Llama-2.
The study delves into the architecture of Hawk and Griffin, detailing their components such as residual blocks, MLP blocks, and temporal-mixing blocks. It highlights the power law scaling exhibited by both models between held-out loss and training FLOPs across different model scales.
Griffin's Real-Gated Linear Recurrent Unit (RG-LRU) layer is introduced as a novel recurrent layer inspired by previous works on non-linear RNNs. The study emphasizes the efficient implementation of linear recurrences on devices like TPUs for optimal training speed.
In terms of inference speed, Griffin showcases lower latency and higher throughput during decoding stages compared to traditional Transformers. The study also explores the models' capabilities in long context modeling and tasks requiring copying and retrieval abilities.
Stats
Hawk exceeds Mamba's performance on downstream tasks despite being trained on fewer tokens.
Griffin matches Llama-2's performance while using significantly fewer tokens.
Both Hawk and Griffin achieve comparable training efficiency to Transformers on TPU-v3.
During inference, Hawk and Griffin achieve significantly higher throughput than MQA Transformers.
Griffin performs better than Transformers when evaluated on longer sequences not seen during training.
Quotes
"Both Hawk and Griffin exhibit power law scaling between held-out loss and training FLOPs."
"Griffin achieves slightly lower held-out loss than strong Transformer baselines at all model scales."
"Hawk exceeds the reported performance of Mamba on downstream tasks."
"Griffin can extrapolate on sequences significantly longer than those seen during training."