Contemporary Recurrent Language Models Match or Exceed Transformer Performance in Predicting Human Language Comprehension Metrics
المفاهيم الأساسية
Contemporary recurrent language model architectures such as RWKV and Mamba can match or exceed the performance of transformer language models in predicting metrics of online human language comprehension, including neural measures like the N400 and behavioral measures like reading time.
الملخص
This study compares the performance of transformer language models (Pythia) and two contemporary recurrent language model architectures (RWKV and Mamba) in predicting various metrics of online human language comprehension, including neural measures like the N400 and behavioral measures like reading time.
The key findings are:
-
On the N400 datasets, the recurrent models (Mamba and RWKV) generally outperform the transformer models, especially when comparing models of the same size. This suggests that transformers are not uniquely well-suited for modeling the N400.
-
For the reading time datasets, the results are more mixed, with some showing positive scaling (larger/better models perform better) and others showing inverse scaling (larger/better models perform worse). This aligns with previous work on the complex relationship between language model performance and reading time metrics.
-
When comparing models by perplexity rather than just size, an interesting pattern emerges - for datasets showing positive scaling, Mamba (the best perplexity model) performs relatively worse compared to the other architectures, while RWKV (the worst perplexity model) performs relatively better. The opposite is true for datasets showing inverse scaling. This suggests that a language model's ability to predict the next word impacts its ability to model human language comprehension beyond just model size and architecture.
-
The results highlight that there is no single universal pattern accounting for the relationship between language model probability and all metrics of online human language comprehension. The relationship is complex and depends on the specific dataset, metric, and model architecture.
Overall, the findings demonstrate that contemporary recurrent language models can match or exceed transformer performance in modeling human language comprehension, opening up new directions for research on the cognitive plausibility of different language model architectures.
إعادة الكتابة بالذكاء الاصطناعي
إنشاء خريطة ذهنية
من محتوى المصدر
Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics
الإحصائيات
The surprisal values calculated from the language models were used to predict the following metrics of human language comprehension:
N400 amplitude (6 datasets)
Maze response time (1 dataset)
Self-paced reading response time (1 dataset)
Go-past duration (1 dataset)
اقتباسات
"Transformers have supplanted Recurrent Neural Networks as the dominant architecture for both natural language processing tasks and, despite criticisms of cognitive implausibility, for modelling the effect of predictability on online human language comprehension."
"Nonetheless, the question of how recurrent and transformer language models compare as cognitive models of the human language system is still an open one."
"Transformers are therefore no longer the definitively best-performing language model architecture, and it is no longer the case that we should expect further advances in transformers to necessarily lead to improved fit to metrics of human language comprehension."
استفسارات أعمق
How do the architectural differences between transformers and recurrent models, such as the presence of a working memory bottleneck in recurrent models, impact their ability to capture specific aspects of human language processing like local interference effects or lexical priming?
The architectural differences between transformers and recurrent models play a significant role in their ability to capture specific aspects of human language processing. Recurrent models, with their inherent working memory bottleneck, are better suited to model phenomena like local interference effects. This is because the limited working memory capacity in recurrent models mimics the constraints of human working memory, allowing them to capture the effects of interference when processing language in real-time. On the other hand, transformers, with their finite context window and perfect access to all words within that window, may struggle to model local interference effects as effectively as recurrent models.
In terms of lexical priming, transformers may have an advantage due to their direct access to previous words, which can better capture priming effects where the processing of one word influences the processing of subsequent words. This feature of transformers allows them to leverage the statistical patterns in the data more efficiently, potentially making them more adept at modeling lexical priming effects compared to recurrent models.
Overall, the architectural differences between transformers and recurrent models influence their ability to capture specific aspects of human language processing. Recurrent models excel at modeling phenomena like local interference effects due to their working memory constraints, while transformers may have an edge in capturing lexical priming effects because of their access to a broader context of words.
How do the architectural differences between transformers and recurrent models, such as the presence of a working memory bottleneck in recurrent models, impact their ability to capture specific aspects of human language processing like local interference effects or lexical priming?
The architectural differences between transformers and recurrent models play a significant role in their ability to capture specific aspects of human language processing. Recurrent models, with their inherent working memory bottleneck, are better suited to model phenomena like local interference effects. This is because the limited working memory capacity in recurrent models mimics the constraints of human working memory, allowing them to capture the effects of interference when processing language in real-time. On the other hand, transformers, with their finite context window and perfect access to all words within that window, may struggle to model local interference effects as effectively as recurrent models.
In terms of lexical priming, transformers may have an advantage due to their direct access to previous words, which can better capture priming effects where the processing of one word influences the processing of subsequent words. This feature of transformers allows them to leverage the statistical patterns in the data more efficiently, potentially making them more adept at modeling lexical priming effects compared to recurrent models.
Overall, the architectural differences between transformers and recurrent models influence their ability to capture specific aspects of human language processing. Recurrent models excel at modeling phenomena like local interference effects due to their working memory constraints, while transformers may have an edge in capturing lexical priming effects because of their access to a broader context of words.
How do the architectural differences between transformers and recurrent models, such as the presence of a working memory bottleneck in recurrent models, impact their ability to capture specific aspects of human language processing like local interference effects or lexical priming?
The architectural variances between transformers and recurrent models, notably the presence of a working memory bottleneck in recurrent models, have a substantial impact on their capacity to capture distinct aspects of human language processing such as local interference effects and lexical priming. Recurrent models, characterized by their limited working memory, are adept at modeling local interference effects. This is attributed to their ability to simulate the constraints of human working memory, enabling them to effectively capture the interference that occurs during real-time language processing. Conversely, transformers, with their fixed context window and unrestricted access to all preceding words, may encounter challenges in accurately modeling local interference effects compared to recurrent models.
Regarding lexical priming, transformers may possess an advantage due to their direct access to prior words, facilitating a more precise representation of priming effects where the comprehension of one word influences the processing of subsequent words. This feature equips transformers to leverage the statistical regularities in the data more efficiently, potentially making them more proficient in modeling lexical priming effects in contrast to recurrent models.
In essence, the architectural distinctions between transformers and recurrent models play a pivotal role in shaping their efficacy in capturing specific facets of human language processing. Recurrent models excel in modeling phenomena like local interference effects by virtue of their working memory limitations, while transformers may demonstrate superiority in capturing lexical priming effects owing to their broader contextual access to words.