Leiter, C., & Eger, S. (2024). PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation. arXiv preprint, arXiv:2406.18528v2.
This paper investigates the effectiveness and robustness of open-source large language models (LLMs) as evaluation metrics for machine translation (MT) and summarization tasks. The research aims to understand the impact of different prompting strategies on the performance of these LLM-based metrics.
The authors conducted a large-scale experiment, PrExMe (Prompt Exploration for Metrics), evaluating over 720 prompt templates across seven open-source LLMs. They tested these prompts on MT and summarization datasets, totaling over 6.6 million evaluations. The study involved hierarchical template design, incorporating techniques like chain-of-thought (CoT), zero-shot, and retrieval-augmented generation (RAG). They evaluated the correlation of LLM-generated scores with human judgments using Kendall, Pearson, and Spearman correlations, along with tie-calibrated accuracy.
The research highlights the significant impact of prompt engineering on the performance of LLM-based evaluation metrics. It underscores the need for careful prompt design and selection to ensure robust and reliable evaluation results. The authors suggest that understanding model-specific prompt preferences and leveraging median performance across various prompting patterns can guide the development of more effective LLM-based metrics.
This study provides valuable insights into the capabilities and limitations of open-source LLMs for evaluating text generation tasks. It emphasizes the importance of robust prompt engineering practices in leveraging LLMs as reliable evaluation metrics, particularly in scenarios where fine-tuning or access to closed-source models is restricted.
The authors acknowledge limitations in the scope of explored prompting approaches and the potential for data leakage during LLM training. Future research could explore a wider range of prompting techniques, investigate the impact of prompt complexity on different LLMs, and address potential biases in LLM-based evaluation.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문