Leveraging Large Language Models to Diversify References and Improve Natural Language Generation Evaluation
Enriching the number of references in NLG benchmarks can significantly enhance the correlation between automatic evaluation metrics and human judgments.