This paper presents a novel, fully automated evaluation framework for Retrieval-Augmented Generation (RAG) systems, called AutoNuggetizer, which leverages large language models (LLMs) to automatically create and assign "nuggets" of information to assess the quality of system-generated answers. Initial results from the TREC 2024 RAG Track demonstrate a strong correlation between this automated approach and manual evaluation by human assessors, suggesting its potential as a reliable and efficient alternative for evaluating and iterating on RAG systems.
This paper introduces LONG2RAG, a new benchmark for evaluating how well large language models (LLMs) can use retrieved information to generate long-form answers, along with a new metric called Key Point Recall (KPR) that focuses on the model's ability to incorporate key points from retrieved documents into its responses.
RAGEval is a novel framework for generating scenario-specific datasets to evaluate the performance of Retrieval-Augmented Generation (RAG) systems, focusing on factual accuracy and scenario-specific knowledge.
CoFE-RAG 框架透過引入多粒度關鍵詞和一個涵蓋多樣化數據場景的基準數據集,實現了對檢索增強生成模型全鏈路的全面評估,有效解決了傳統評估方法依賴黃金文本塊標註的局限性。
CoFE-RAG is a novel evaluation framework designed to comprehensively assess all stages of Retrieval-Augmented Generation (RAG) systems, addressing limitations of existing methods by incorporating diverse data scenarios and multi-granularity keywords for enhanced analysis.
FRAMES, a novel evaluation dataset, comprehensively tests the factuality, retrieval, and reasoning capabilities of retrieval-augmented generation (RAG) systems in a unified framework.
ARES is an automated framework for evaluating retrieval-augmented generation (RAG) systems along the dimensions of context relevance, answer faithfulness, and answer relevance.