The CREAM framework introduces a novel approach to automatically evaluating meeting summarization models. It addresses the limitations of existing LLM-based evaluators, which struggle with accurately assessing completeness and conciseness for long-context dialogue summarization tasks.
Key highlights:
Experiments show that current LLM-based evaluators often provide inaccurate scores for meeting summarization, exhibiting high self-bias and weak correlation with human judgments.
CREAM utilizes a two-step process facilitated by a Chain-of-Thought (CoT) prompt. First, it extracts a set of concise key facts from the concatenated summaries. Then, it compares these key facts to each summary to assess completeness and conciseness.
CREAM employs an Elo ranking system to systematically compare model performance based on the comparison-based scores, providing a robust mechanism for ranking summarization models.
Evaluation on public and private datasets demonstrates that CREAM outperforms prior baselines, achieving a perfect rank correlation (Pearson's r of 1.0) with human preferences for both completeness and conciseness.
The framework's adaptability allows for customization, enabling users to tailor the evaluation criteria to specific needs, such as emphasizing aspects most relevant to the intended audience or application.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Ziwei Gong, ... lúc arxiv.org 09-18-2024
https://arxiv.org/pdf/2409.10883.pdfYêu cầu sâu hơn