The authors present FABLES, a dataset of human annotations on the faithfulness and content selection of LLM-generated summaries of 26 recently published fictional books. They hired annotators who had fully read each book prior to the annotation task to mitigate the issue of data contamination.
The study finds that CLAUDE-3-OPUS significantly outperforms all closed-source LLMs in terms of faithfulness, while the open-source MIXTRAL is on par with GPT-3.5-TURBO. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
The authors also implement several LLM-based raters of faithfulness, but find that none correlate strongly with human annotations, especially with regard to detecting unfaithful claims. This suggests that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding.
Beyond faithfulness, the study explores content selection errors in book-length summarization. The authors develop a typology of omission errors related to crucial narrative elements and identify a systematic over-emphasis on events occurring towards the end of the book.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Yekyung Kim,... في arxiv.org 04-02-2024
https://arxiv.org/pdf/2404.01261.pdfاستفسارات أعمق