Core Concepts
Existing LLM-based evaluation functions do not correlate well with human judgments in assessing the quality of research questions extracted from academic papers, suggesting the need for developing specialized evaluation functions for this task.
Abstract
This study constructed a new dataset consisting of:
Abstracts and introductions of 104 machine learning research papers accepted at ACL from 2016 to 2023.
Research questions (RQ) extracted from these abstracts and introductions using GPT-4 with three different prompts.
Human annotations evaluating the quality of the extracted RQ from three perspectives: accurately capturing the problem, accurately capturing the method, and conforming to the expected RQ format.
Using this dataset, the study compared the correlation between the scores output by various LLM-based evaluation functions and the human annotations. The results showed that the existing LLM-based evaluation functions have only low to moderate correlation with human judgments, particularly for aspects other than identifying the method. This suggests the need to develop specialized evaluation functions tailored to the RQ extraction task in the research paper domain, as the existing functions designed for news summarization may not be sufficient.
The study also analyzed common patterns in the RQ that were incorrectly evaluated, the impact of input/output token counts, the reproducibility of the methods, and strategies for performance improvement. Key insights include the importance of modeling the evaluation procedure, the limited impact of increasing the number of evaluation steps, and the potential overestimation of scores by some methods.
Overall, this work provides a foundation for further research on developing better evaluation functions for RQ extraction, which is crucial for enhancing the performance of this task and improving the understanding of research papers.
Stats
The research papers used in this study were accepted at ACL from 2016 to 2023.
The average length of the paper abstracts and introductions was 250 tokens.
Quotes
"Existing LLM-based evaluation functions do not correlate well with human judgments in assessing the quality of research questions extracted from academic papers, suggesting the need for developing specialized evaluation functions for this task."
"This study constructed a new dataset consisting of: 1. Abstracts and introductions of 104 machine learning research papers accepted at ACL from 2016 to 2023. 2. Research questions (RQ) extracted from these abstracts and introductions using GPT-4 with three different prompts. 3. Human annotations evaluating the quality of the extracted RQ from three perspectives: accurately capturing the problem, accurately capturing the method, and conforming to the expected RQ format."