The paper introduces the WARRI dataset, which contains parallel sentences in English and the two written genres of Nigerian Pidgin (BBC and Wikipedia). Through statistical analysis and machine translation experiments, the authors demonstrate that the BBC and Wikipedia genres of Nigerian Pidgin are significantly different in terms of word order, vocabulary, and linguistic features.
The BBC genre is closer to English and favored by more educated Nigerians, while the Wikipedia genre is closer to the spoken form of Nigerian Pidgin and more accessible to a wider range of speakers. The authors find that current Generative AI models, such as GPT-4-TURBO and LLAMA 2 13B, are biased towards the BBC genre and do not adequately represent the Wikipedia genre. This bias can lead to exclusion of certain communities of Nigerian Pidgin speakers.
The paper highlights the importance of ensuring representativeness of linguistic variation in the data used to train Generative AI systems, especially for multilingual and low-resource languages like Nigerian Pidgin. The authors recommend that the different genres of a language should be integrated into the systems separately to achieve better inclusivity and accessibility for diverse user groups.
toiselle kielelle
lähdeaineistosta
arxiv.org
Syvällisempiä Kysymyksiä