Giulianelli, M., Malagutti, L., Gastaldi, J. L., DuSell, B., Vieira, T., & Cotterell, R. (2024). On the Proper Treatment of Tokenization in Psycholinguistics. arXiv preprint arXiv:2410.02691.
This paper addresses the issue of misalignment between token-level language models and the character-level nature of psycholinguistic stimuli. The authors argue for the use of character-level surprisal, calculated over specific focal areas, as a more accurate predictor of processing difficulty in reading studies.
The authors propose marginalizing token-level language models into character-level models to compute surprisal for arbitrary character substrings. They test the effectiveness of different focal areas, including fixed-size, dynamically sized, and look-ahead areas, on four eye-tracking datasets (UCL, Provo, MECO, and CELER) by comparing their predictive power for skip rate and reading times (first fixation, gaze, and total duration).
The choice of focal area significantly impacts the predictive power of surprisal in modeling reading behavior. Character-level surprisal, calculated over focal areas informed by psycholinguistic insights, provides a more nuanced and accurate representation of processing difficulty than traditional approaches relying on full-region surprisal.
This research highlights the importance of aligning computational methods with the character-level nature of psycholinguistic phenomena. It provides a framework for researchers to leverage the power of modern language models while respecting the inherent properties of human language processing.
The study focuses on English eye-tracking data and does not address the complexities of self-paced reading paradigms. Future research should explore the applicability of focal area predictors to other languages, reading paradigms, and psycholinguistic measurements. Additionally, investigating non-linear relationships between surprisal and reading behaviors, as well as accounting for individual differences, are promising avenues for further exploration.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Mario Giulia... at arxiv.org 10-04-2024
https://arxiv.org/pdf/2410.02691.pdfDeeper Inquiries