Core Concepts
Large language models can be used to automatically generate and evaluate multiple-choice reading comprehension test items with acceptable quality, especially for languages with limited available data.
Abstract
The paper explores the use of large language models (LLMs) for automatically generating and evaluating multiple-choice reading comprehension (MCRC) test items in the German language. The authors compiled a dataset of German texts and MCRC items from online language courses, and developed a new evaluation protocol and metric called "text informativity" to assess the quality of generated items.
The key highlights and insights from the paper are:
Zero-shot generation with state-of-the-art LLMs like GPT-4 and Llama 2 can produce MCRC items of acceptable quality, even for languages with limited available data.
The proposed evaluation protocol measures the "answerability" and "guessability" of MCRC items by having human annotators or LLMs respond to the items with and without seeing the corresponding text. The difference between these two metrics, called "text informativity", indicates how well the item tests reading comprehension.
Applying the evaluation protocol, the authors found that items generated by GPT-4 outperformed those generated by Llama 2 in terms of text informativity and human quality ratings.
Using GPT-4 as an automatic evaluator produced results most similar to human annotators, suggesting that LLMs can be a viable alternative to human evaluation for this task.
Qualitative analysis revealed that the main challenges in generating high-quality MCRC items are avoiding easily guessable questions and ensuring that the items are unambiguously answerable based on the given text.
Overall, the paper demonstrates the potential of using LLMs for zero-shot generation and automatic evaluation of MCRC items, which can be particularly useful for languages with limited assessment resources.
Stats
The average text length in the dataset is 327 tokens.
66% of the MCRC items allow multiple correct answer options.
Quotes
"Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts."
"Given the recent advancements in the zero-shot capabilities of LLMs, automatically generating MCRC items appears to be a promising option."
"Text informativity is the difference between answerability and guessability and denotes to what degree the text informs the item responses."