Evaluating Faithfulness and Content Selection in Book-Length Summarization by Large Language Models
Large language models can technically summarize book-length documents, but the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. This study conducts the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books.