The paper proposes a framework called GeMQuAD to generate synthetic question-answer (Q&A) data in low-resource languages like Hindi and Spanish using few-shot learning on the AlexaTM 20B large language model. The key steps are:
Synthetic Data Generation: The authors use in-context learning (ICL) on AlexaTM 20B with just 1 annotated example in the target language to generate synthetic Q&A pairs for contexts from the XTREME dataset.
Semi-Supervised Data Filtering: The authors apply a semi-supervised learning approach based on WeakDAP to identify high-quality synthetic Q&A pairs. An initial student model (XLM-R-Base) trained on English data is used as a weak labeler to filter the synthetic data. The student model is then iteratively fine-tuned on the filtered "silver" data along with the English "gold" data.
Student Model Fine-Tuning: The student model is fine-tuned first on the filtered silver data and then on the gold English data, prioritizing the higher quality silver data.
The authors evaluate the performance of the student model on the MLQA and XQUAD datasets for Hindi and Spanish. Their approach outperforms the baseline model trained only on English data by 5.05/6.50 F1/EM points for Hindi and 3.81/3.69 points for Spanish on MLQA. It also surpasses the performance of a model trained on machine-translated data by 0.22/1.68 F1/EM for Hindi and 0.82/1.37 F1/EM for Spanish on MLQA.
The authors also observe improvements in languages not included in the fine-tuning data, such as German, Arabic, Vietnamese, and Chinese, demonstrating the enhanced cross-lingual transfer capability of their approach.
To Another Language
from source content
arxiv.org
Djupare frågor