This study constructed a new dataset consisting of:
Using this dataset, the study compared the correlation between the scores output by various LLM-based evaluation functions and the human annotations. The results showed that the existing LLM-based evaluation functions have only low to moderate correlation with human judgments, particularly for aspects other than identifying the method. This suggests the need to develop specialized evaluation functions tailored to the RQ extraction task in the research paper domain, as the existing functions designed for news summarization may not be sufficient.
The study also analyzed common patterns in the RQ that were incorrectly evaluated, the impact of input/output token counts, the reproducibility of the methods, and strategies for performance improvement. Key insights include the importance of modeling the evaluation procedure, the limited impact of increasing the number of evaluation steps, and the potential overestimation of scores by some methods.
Overall, this work provides a foundation for further research on developing better evaluation functions for RQ extraction, which is crucial for enhancing the performance of this task and improving the understanding of research papers.
Til et annet språk
fra kildeinnhold
arxiv.org
Viktige innsikter hentet fra
by Yuya Fujisak... klokken arxiv.org 09-12-2024
https://arxiv.org/pdf/2409.06883.pdfDypere Spørsmål