Ko Freitag, R. M., & de Gois, T. S. (2024). Performance in a dialectal profiling task of LLMs for varieties of Brazilian Portuguese. arXiv preprint arXiv:2410.10991v1.
This research paper investigates the ability of four LLMs (GPT 3.5, GPT-4o, Gemini, and Sabiá-2) to identify dialectal variations in Brazilian Portuguese and assess their potential for exhibiting dialectal biases.
The researchers designed a three-stage study involving: 1) Target-profile generation: LLMs generated text passages simulating typical linguistic profiles for each of Brazil's 27 states. 2) Target-profile classification: LLMs identified the state of origin for the generated texts using two prompt types: one with only the text and another with additional linguistic clues. 3) Data Wrangling: Cleaning, standardizing, and structuring the classification data for analysis. Three human experts also classified the texts for comparison. Agreement was measured using Fleiss’ Kappa.
While LLMs can detect dialectal variation, their ability to accurately identify specific regional dialects in Brazilian Portuguese remains limited and inconsistent. This highlights potential biases in training data and emphasizes the need for further research to ensure linguistic justice in AI.
This study contributes to the growing field of sociolinguistic analysis of LLMs, highlighting the importance of addressing dialectal biases to ensure fairness and inclusivity in NLP applications.
The study acknowledges limitations in the number of human evaluators and suggests further research with larger and more diverse datasets to improve the accuracy and fairness of dialectal processing in LLMs.
To Another Language
from source content
arxiv.org
Deeper Inquiries