Core Concepts
This survey examines past research on adapting natural language processing (NLP) techniques to handle dialects of a language, with the goal of building more inclusive and equitable language technologies.
Abstract
This survey provides a comprehensive overview of past research on natural language processing (NLP) for dialects of a language. It covers a wide range of languages, including English, Arabic, Chinese, German, and Indic languages, among others.
The survey begins by motivating the need for dialect-aware NLP, highlighting the linguistic challenges posed by dialectic variations, the importance of rethinking benchmark datasets for large language models, and the implications for building fair and equitable language technologies. It then outlines the scope and key trends in this research area.
The survey delves into the resources available for dialects, including dialectic lexicons and datasets. It then covers NLP tasks in two broad categories: natural language understanding (NLU) and natural language generation (NLG). For NLU, the survey discusses approaches for dialect identification, sentiment analysis, parsing, and NLU benchmarks. For NLG, it covers machine translation, summarization, and conversational AI.
The survey highlights that past work in NLP for dialects goes beyond mere dialect classification, with recent approaches integrating dialect-awareness into model architectures using techniques like adversarial networks and hypernetworks. It also notes the growing trend of incorporating dialectic aspects to address social and cultural factors in language technologies.
The survey concludes by discussing future directions and the social and ethical implications of this research, emphasizing the importance of linguistic and cultural inclusion in the development of NLP systems.