Core Concepts
Large language models can technically summarize book-length documents, but the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. This study conducts the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books.
Abstract
The authors present FABLES, a dataset of human annotations on the faithfulness and content selection of LLM-generated summaries of 26 recently published fictional books. They hired annotators who had fully read each book prior to the annotation task to mitigate the issue of data contamination.
The study finds that CLAUDE-3-OPUS significantly outperforms all closed-source LLMs in terms of faithfulness, while the open-source MIXTRAL is on par with GPT-3.5-TURBO. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
The authors also implement several LLM-based raters of faithfulness, but find that none correlate strongly with human annotations, especially with regard to detecting unfaithful claims. This suggests that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding.
Beyond faithfulness, the study explores content selection errors in book-length summarization. The authors develop a typology of omission errors related to crucial narrative elements and identify a systematic over-emphasis on events occurring towards the end of the book.
Stats
The mean length of books in the dataset is 121,467 tokens.
The dataset contains 3,158 claim-level annotations across 26 books.