The content introduces COSMIC, a novel evaluation metric for summarizers based on mutual information. It addresses challenges in assessing summarizer quality and demonstrates strong correlations with human judgment-based metrics. The approach is theoretically grounded and offers insights into the effectiveness of summaries for downstream tasks.
Standard automatic evaluation methods like BLEU and ROUGE often do not align well with human judgments. Recent efforts focus on learned metrics to score summaries accurately. The proposed method evaluates the probability distribution of summaries induced by summarizers, emphasizing task-oriented evaluation setups.
The study introduces two modifications to the conventional paradigm of evaluating individual output summaries. It proposes evaluating the probability distribution of summaries induced by summarizers and adopts a task-oriented evaluation setup where summaries enable agents to perform downstream tasks without reading the source text.
The research establishes an information-theoretic rationale for a metric that involves assessing mutual information between source texts and generated summaries. The quality assessment framework is framed as a statistical inference problem, deriving a reference-free quality metric based on mutual information.
Contributions include framing summarizer evaluation as a statistical inference problem, proposing COSMIC as a practical implementation, and experimentally evaluating MI's predictive performance compared to conventional metrics like BERTScore and BARTScore.
Til et annet språk
fra kildeinnhold
arxiv.org
Viktige innsikter hentet fra
by Maxime Darri... klokken arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.19457.pdfDypere Spørsmål