Core Concepts
Reference-free metrics show higher correlation with human judgment and sensitivity to language quality deficiencies compared to reference-based metrics.
Abstract
The content discusses the necessity of reference in evaluating NLG systems, comparing reference-based and reference-free metrics. It explores when and where reference-free metrics can be effective, highlighting their correlation with human judgment and sensitivity to language quality issues. The study provides insights into metric performance across various tasks, datasets, and evaluation models.
Directory:
Abstract
Automatic metrics for evaluating NLG systems are predominantly reference-based.
Challenges in collecting human annotations lead to interest in reference-free metrics.
Introduction
Automatic evaluation metrics play a crucial role in NLG development.
Preliminary
Criteria like coherence, consistency, and fluency are defined for evaluation.
Experiments
Performance of metrics on different datasets and criteria is evaluated.
Perturbation Experiments
Perturbation tests reveal the ability of metrics to detect text defects.
Kolmogorov-Smirnov Test
KS scores show the capability of metrics to distinguish high-quality from low-quality texts.
Stability Analysis
Meta-correlation analysis explores metric stability with varying system quality.
Conclusion
Recommendations on utilizing automatic metrics effectively.
Stats
The majority of automatic metrics for evaluating NLG systems are reference-based.
Recent advancements have led to interest in reference-free metrics due to challenges in collecting human annotations.