Conceptos Básicos
PROMETHEUS is an open-source Large Language Model (LLM) that matches GPT-4's evaluation capabilities, emphasizing the importance of reference materials for fine-grained evaluation.
Resumen
The article introduces PROMETHEUS, an open-source LLM designed for fine-grained evaluation tasks. It addresses the limitations of using proprietary LLMs due to their closed-source nature, uncontrolled versioning, and prohibitive costs. The authors propose a new dataset called FEEDBACK COLLECTION, consisting of score rubrics, instructions, responses, and feedback. Training PROMETHEUS on this dataset results in high correlation with human evaluators and outperforms other models like GPT-4 and GPT-3.5-Turbo. The inclusion of reference materials like score rubrics and reference answers is crucial for effective evaluation.
Abstract:
- Proprietary LLMs pose challenges for large-scale evaluation tasks.
- PROMETHEUS is an open-source LLM trained on the FEEDBACK COLLECTION dataset.
- Achieves high correlation with human evaluators and outperforms other models.
Introduction:
- Human evaluation remains essential in NLP.
- Automated metrics lack depth compared to human assessment.
- Using LLMs like GPT-4 as evaluators has gained attention but has limitations.
Data Extraction:
- "Experimental results show that PROMETHEUS scores a Pearson correlation of 0.897 with human evaluators."
- "Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks shows similar trends."
Estadísticas
最初の50個のシードルーブリックを作成します。
GPT-4を使用して、初期の50個から1000個のスコアルーブリックに拡張します。
新しいスコアルーブリックに関連する20Kのユニークな命令を生成します。
Citas
"Applying LLMs (e.g., GPT-4) as an evaluator has received substantial attention due to its potential parity with human evaluation."
"However, while the merits of using proprietary LLMs as an evaluation tool are evident, there exist some critical disadvantages."