Khái niệm cốt lõi
Neural models can guide linguists during fieldwork by optimizing the data collection process and accounting for the dynamics of linguist-speaker interactions.
Tóm tắt
This paper presents a novel approach to leveraging neural models to enhance the efficiency of linguistic fieldwork, with a focus on the collection of morphological data.
The key highlights and insights are:
The authors introduce a framework that evaluates the effectiveness of various sampling strategies for obtaining morphological data and assesses the ability of state-of-the-art neural models to generalize morphological structures.
The experiments highlight two key strategies for improving the efficiency of the data collection process:
Increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables.
Using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.
The results show that uniform random sampling across paradigm cells leads to more representative data and better generalization, outperforming strategies that prioritize the completion of full paradigms or focus on the most confident predictions.
The authors also introduce a new metric, the Normalized Efficiency Score, to better capture the efficiency of the elicitation process by considering the number of interactions with the speaker and the accuracy of the final model.
The study examines a range of typologically diverse languages, providing insights into the effectiveness of the proposed approach across different morphological systems and data availability conditions.
Overall, this work demonstrates how neural models can be leveraged to guide linguists during fieldwork, making the process of data collection more efficient and informative.
Thống kê
The total number of wordforms, lemmas, and average paradigm size (APS) for the selected part-of-speech across the examined languages are as follows:
Language
Wordforms
Lemmas
APS
English
5,120
1,280
4
Latin
240,078
5,185
89
Russian
208,198
18,008
16
Central Kurdish
21,375
375
57
Turkish
80,264
380
295
Mongolian
14,396
2,057
8
Central Pame
12,528
216
58
Murrinh-patha
1,110
30
37